This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Move kill lowering to WQM pass and add live mask tracking
ClosedPublic

Authored by critson on Jan 14 2021, 10:04 PM.

Download Raw Diff

Details

Reviewers

foad
arsenm
rampitec
piotr

Commits

rGc16f776028dd: [AMDGPU] Move kill lowering to WQM pass and add live mask tracking

Summary

Move implementation of kill intrinsics to WQM pass. Add live lane
tracking by updating a stored exec mask when lanes are killed.
Use live lane tracking to enable early termination of shader
at any point in control flow.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

critson created this revision.Jan 14 2021, 10:04 PM

Herald added subscribers: kerbowa, jfb, hiraditya and 7 others. · View Herald TranscriptJan 14 2021, 10:04 PM

critson requested review of this revision.Jan 14 2021, 10:04 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2021, 10:04 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

critson added a child revision: D94747: [AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic.Jan 14 2021, 10:06 PM

critson added a child revision: D94748: [AMDGPU] Tidy up conditional branches from early termination.

critson mentioned this in D67767: [AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic and live mask tracking.

Harbormaster completed remote builds in B85291: Diff 316842.Jan 14 2021, 10:43 PM

piotr added a subscriber: piotr.Jan 15 2021, 2:12 AM

piotr added inline comments.

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
621–629	Unused?
639	A missing word after "in"?
869	WQMMaskMI is always nullptr here.
1056	It looks the new param NonWWMState is unused?

piotr added inline comments.Jan 15 2021, 2:21 AM

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
1056	I can see you are going to use it in D94747, but as it stands now it will cause a build warning.

piotr added inline comments.Jan 15 2021, 3:28 AM

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll
87	The generated code for this test (and a few others) is slightly unexpected (all three patches combined): Before: v_cmpx_lt_f32_e32 vcc, 0, v0 After: v_cmp_gt_f32_e32 vcc, 0, v0 s_andn2_b64 exec, exec, vcc s_andn2_b64 exec, exec, vcc

foad added inline comments.Jan 15 2021, 7:24 AM

llvm/lib/Target/AMDGPU/SIInsertSkips.cpp
209–210	Seems like dead code since OldSuccessor is never set to anything useful.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
15	Are you changing whether or not this pass can assume SSA form?
152–156	I think this was changed to a MapVector to give a stable iteration order, so changing it back to DenseMap seems dangerous.

critson marked 8 inline comments as done.Jan 15 2021, 5:36 PM

critson added inline comments.

llvm/lib/Target/AMDGPU/SIInsertSkips.cpp
209–210	This belongs in D94748.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
15	Yes, since the pass is now run after scheduler drop support for SSA form. Supporting live mask tracking is much simpler non-SSA (no need to add virtual registers and PHIs tracking through every program block). While I could support both, the code in WQM pass would be very large for a behaviour we are not using.
152–156	It seems that change occurred during the development of this patch. I missed that, so failed to incorporate it.
621–629	This is legacy of an older version that should have been deleted.
869	Yep, this should be in D94747.
1056	Move to D94746.

Address reviewer comments.
Move misplaced code to other reviews.

critson added inline comments.Jan 15 2021, 6:22 PM

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll
87	So what is happening is the mask update and the exec update use the same register, and shader is marked GS. Post WQM: // live mask generated: %3:sreg_64 = COPY $exec // kill: %0:vgpr_32 = COPY $vgpr0 V_CMP_GT_F32_e32 0, %0:vgpr_32, implicit-def $vcc, implicit $mode, implicit $exec // live mask update: dead %3:sreg_64 = S_ANDN2_B64 %3:sreg_64, $vcc, implicit-def $scc SI_EARLY_TERMINATE_SCC0 implicit $exec, implicit $scc // kill implemented: $exec = S_ANDN2_B64 $exec, $vcc, implicit-def $scc Here the SI_EARLY_TERMINATE_SCC0 generates nothing because the test shader is marked amdgpu_gs. I think the test shaders are too trivial to be representative of real code generation. I added the sendmsg otherwise these shaders optimise away to nothing. It still could be reasonable to add a peephole to clean these up.

Harbormaster completed remote builds in B85461: Diff 317126.Jan 15 2021, 6:37 PM

Remove VCC def flags from SI_KILL_I1 and add test.
This bug existed prior to this patch, but was not causing any issues as control flow lowering does not track liveness.
However it matters for WQM lowering of kills as it can lead to stray definitions of VCC.

Harbormaster completed remote builds in B85511: Diff 317205.Jan 16 2021, 7:59 PM

LGTM with a few more nits (inline).

Being able to query live lanes at any point in the shader makes sense to me and I really like the removal of SI_KILL_CLEANUP.

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
866	LiveMaskWQM unused.
1066	Nit: I tend to agree with the clang-tidy warning. 's/isEntry/IsEntry/' for consistency?
1357–1362	Is the comment up-to-date? Did you mean: "does not need WQM nor WWM"?
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll
87	Fair enough. Thanks for checking.

foad added a reviewer: piotr.Jan 18 2021, 6:45 AM

foad added inline comments.Jan 18 2021, 7:30 AM

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
213	I don't think you need to "require" this since you don't explicitly use it for anything.
642	Maybe just choose the new opcode inside the switch, and pull the get/setDesc calls out?
689	I'm a bit sceptical of some of the opcodes in this table. E.g. SETOGT --(swap operands)--> SETOLT --(invert)-->SETUGE which is V_CMP_NLT, but the table here says V_CMP_GT. It might be more obvious what's going on if you call getSetCCSwappedOperands and getSetCCInverse first, and then have a simple lookup from SETxx to opcode. Or you could avoid the "invert" by using AND instead of ANDN2?
704	It's a bit more conventional to lump SETNE together with SETUNE (unlike all the other comparisons which go with the O form), though of course it doesn't really matter.
1075–1076	No need to add braces.

critson marked 9 inline comments as done.Jan 19 2021, 4:56 PM

critson added inline comments.

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
689	I agree these are wrong, but the inversion is required because this will only generate comparison results for active lanes -- so in non-uniform control flow this will not generate a complete mask to use AND.
704	The branches of the switch are lifted from the old kill lowering code. So I do not plan to touch it for now.

Address review comments.
Fix conversion table for comparisons in F32 kills.
Fix a bug where conditional kills could terminate WQM.

Harbormaster completed remote builds in B85797: Diff 317727.Jan 19 2021, 6:12 PM

piotr added inline comments.Jan 20 2021, 2:18 AM

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
726–730	This looks suspicious: both SETUGT and SETUGE map to V_CMP_NGE.

Rework F32 kill condition code translation as it was still wrong.
There is still an outstanding bug in WWM somewhere.

Harbormaster completed remote builds in B86259: Diff 318471.Jan 22 2021, 3:24 AM

foad added inline comments.Jan 22 2021, 6:34 AM

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
692	I still don't trust this table! I think the "O" predicates should map to "N" opcodes e.g. SETOLT -> V_CMP_NLT as explained above.

critson added inline comments.Jan 27 2021, 3:37 AM

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

692

Are you expecting the table to look like this?

case ISD::SETEQ:
  Opcode = AMDGPU::V_CMP_LG_F32_e64;
  break;
case ISD::SETGT:
  Opcode = AMDGPU::V_CMP_GE_F32_e64;
  break;
case ISD::SETGE:
  Opcode = AMDGPU::V_CMP_GT_F32_e64;
  break;
case ISD::SETLT:
  Opcode = AMDGPU::V_CMP_LE_F32_e64;
  break;
case ISD::SETLE:
  Opcode = AMDGPU::V_CMP_LT_F32_e64;
  break;
case ISD::SETNE:
  Opcode = AMDGPU::V_CMP_EQ_F32_e64;
  break;
case ISD::SETO:
  Opcode = AMDGPU::V_CMP_O_F32_e64;
  break;
case ISD::SETUO:
  Opcode = AMDGPU::V_CMP_U_F32_e64;
  break;
case ISD::SETOEQ:
case ISD::SETUEQ:
  Opcode = AMDGPU::V_CMP_NEQ_F32_e64;
  break;
case ISD::SETOGT:
case ISD::SETUGT:
  Opcode = AMDGPU::V_CMP_NLT_F32_e64;
  break;
case ISD::SETOGE:
case ISD::SETUGE:
  Opcode = AMDGPU::V_CMP_NLE_F32_e64;
  break;
case ISD::SETOLT:
case ISD::SETULT:
  Opcode = AMDGPU::V_CMP_NGT_F32_e64;
  break;
case ISD::SETOLE:
case ISD::SETULE:
  Opcode = AMDGPU::V_CMP_NGE_F32_e64;
  break;
case ISD::SETONE:
case ISD::SETUNE:
  Opcode = AMDGPU::V_CMP_NLG_F32_e64;
  break;

I am still testing, but I am unsure if VulkanCTS or game shaders can tell the differences.

foad added inline comments.Jan 27 2021, 5:36 AM

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

692

No I'm expecting it to look like this:

case ISD::SETOEQ:
case ISD::SETEQ:
  Opcode = AMDGPU::V_CMP_NEQ_F32_e64;
  break;
case ISD::SETOGT:
case ISD::SETGT:
  Opcode = AMDGPU::V_CMP_NLT_F32_e64;
  break;
case ISD::SETOGE:
case ISD::SETGE:
  Opcode = AMDGPU::V_CMP_NLE_F32_e64;
  break;
case ISD::SETOLT:
case ISD::SETLT:
  Opcode = AMDGPU::V_CMP_NGT_F32_e64;
  break;
case ISD::SETOLE:
case ISD::SETLE:
  Opcode = AMDGPU::V_CMP_NGE_F32_e64;
  break;
case ISD::SETONE:
  Opcode = AMDGPU::V_CMP_NLG_F32_e64;
  break;
case ISD::SETO:
  Opcode = AMDGPU::V_CMP_U_F32_e64;
  break;
case ISD::SETUO:
  Opcode = AMDGPU::V_CMP_O_F32_e64;
  break;
case ISD::SETUEQ:
  Opcode = AMDGPU::V_CMP_LG_F32_e64;
  break;
case ISD::SETUGT:
  Opcode = AMDGPU::V_CMP_GE_F32_e64;
  break;
case ISD::SETUGE:
  Opcode = AMDGPU::V_CMP_GT_F32_e64;
  break;
case ISD::SETULT:
  Opcode = AMDGPU::V_CMP_LE_F32_e64;
  break;
case ISD::SETULE:
  Opcode = AMDGPU::V_CMP_LT_F32_e64;
  break;
case ISD::SETUNE:
case ISD::SETNE:
  Opcode = AMDGPU::V_CMP_EQ_F32_e64;
  break;

E.g. SETOGT --(swap operands)--> SETOLT --(invert)-->SETUGE which is V_CMP_NLT.

(Though as mentioned previously, SETEQ is undefined on NaNs, so it doesn't matter whether it behaves the same as SETOEQ or the same as SETUEQ. And likewise for all the others that don't have an explicit O or U.)

critson marked 4 inline comments as done.Jan 27 2021, 8:49 PM

critson added inline comments.

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
692	Thanks -- I have now validated this version as well. So I suspect in practice we are not generating anything that hits the edge cases.

Update F32 kill lowering table.

Harbormaster completed remote builds in B86956: Diff 319750.Jan 27 2021, 9:36 PM

critson mentioned this in D95503: [AMDGPU] WQM/WWM: Fix marking of partial definitions.Jan 28 2021, 4:44 PM

Add getClearedProperties to pass to ensure running !IsSSA

Harbormaster completed remote builds in B87100: Diff 320012.Jan 28 2021, 7:08 PM

critson mentioned this in D96143: [AMDGPU] Generate test checks and add GFX10 test coverage.Feb 8 2021, 12:28 AM

piotr accepted this revision.Feb 8 2021, 5:44 AM

This revision is now accepted and ready to land.Feb 8 2021, 5:44 AM

This revision was landed with ongoing or failed builds.Feb 11 2021, 3:32 AM

Closed by commit rGc16f776028dd: [AMDGPU] Move kill lowering to WQM pass and add live mask tracking (authored by critson). · Explain Why

This revision was automatically updated to reflect the committed changes.

critson added a commit: rGc16f776028dd: [AMDGPU] Move kill lowering to WQM pass and add live mask tracking.

foad mentioned this in rGc9d747e9cd6d: [AMDGPU] Remove outdated comment and tidy up. NFC..Jul 6 2021, 3:29 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInsertSkips.cpp

235 lines

SIInstrInfo.cpp

14 lines

SIInstructions.td

10 lines

SILowerControlFlow.cpp

47 lines

SIOptimizeExecMasking.cpp

12 lines

SIWholeQuadMode.cpp

564 lines

test/

CodeGen/

AMDGPU/

atomic_optimizations_pixelshader.ll

12 lines

early-term.mir

57 lines

insert-skips-kill-uncond.mir

llvm.amdgcn.kill.ll

129 lines

llvm.amdgcn.wqm.vote.ll

13 lines

skip-if-dead.ll

180 lines

transform-block-with-return-to-epilog.ll

49 lines

vcmpx-exec-war-hazard.mir

20 lines

vcmpx-permlane-hazard.mir

14 lines

wave32.ll

34 lines

wqm.ll

6 lines

Diff 317727

llvm/lib/Target/AMDGPU/SIInsertSkips.cpp

Show All 37 Lines	private:
MachineDominatorTree *MDT = nullptr;		MachineDominatorTree *MDT = nullptr;

MachineBasicBlock *EarlyExitBlock = nullptr;		MachineBasicBlock *EarlyExitBlock = nullptr;
bool EarlyExitClearsExec = false;		bool EarlyExitClearsExec = false;

bool shouldSkip(const MachineBasicBlock &From,		bool shouldSkip(const MachineBasicBlock &From,
const MachineBasicBlock &To) const;		const MachineBasicBlock &To) const;

bool dominatesAllReachable(MachineBasicBlock &MBB);
void ensureEarlyExitBlock(MachineBasicBlock &MBB, bool ClearExec);		void ensureEarlyExitBlock(MachineBasicBlock &MBB, bool ClearExec);
void skipIfDead(MachineBasicBlock &MBB, MachineBasicBlock::iterator I,
DebugLoc DL);

bool kill(MachineInstr &MI);
void earlyTerm(MachineInstr &MI);		void earlyTerm(MachineInstr &MI);

bool skipMaskBranch(MachineInstr &MI, MachineBasicBlock &MBB);		bool skipMaskBranch(MachineInstr &MI, MachineBasicBlock &MBB);

public:		public:
static char ID;		static char ID;

		unsigned MovOpc;
		Register ExecReg;

SIInsertSkips() : MachineFunctionPass(ID) {}		SIInsertSkips() : MachineFunctionPass(ID) {}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

StringRef getPassName() const override {		StringRef getPassName() const override {
return "SI insert s_cbranch_execz instructions";		return "SI insert s_cbranch_execz instructions";
}		}

▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	for (MachineBasicBlock::const_iterator I = MBB.begin(), E = MBB.end();
if (NumInstr >= SkipThreshold)		if (NumInstr >= SkipThreshold)
return true;		return true;
}		}
}		}

return false;		return false;
}		}

/// Check whether \p MBB dominates all blocks that are reachable from it.
bool SIInsertSkips::dominatesAllReachable(MachineBasicBlock &MBB) {
for (MachineBasicBlock *Other : depth_first(&MBB)) {
if (!MDT->dominates(&MBB, Other))
return false;
}
return true;
}

static void generateEndPgm(MachineBasicBlock &MBB,		static void generateEndPgm(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I, DebugLoc DL,		MachineBasicBlock::iterator I, DebugLoc DL,
const SIInstrInfo *TII, bool IsPS) {		const SIInstrInfo *TII, bool IsPS) {
// "null export"		// "null export"
if (IsPS) {		if (IsPS) {
BuildMI(MBB, I, DL, TII->get(AMDGPU::EXP_DONE))		BuildMI(MBB, I, DL, TII->get(AMDGPU::EXP_DONE))
.addImm(AMDGPU::Exp::ET_NULL)		.addImm(AMDGPU::Exp::ET_NULL)
.addReg(AMDGPU::VGPR0, RegState::Undef)		.addReg(AMDGPU::VGPR0, RegState::Undef)
Show All 18 Lines	if (!EarlyExitBlock) {
MF->insert(MF->end(), EarlyExitBlock);		MF->insert(MF->end(), EarlyExitBlock);
generateEndPgm(*EarlyExitBlock, EarlyExitBlock->end(), DL, TII,		generateEndPgm(*EarlyExitBlock, EarlyExitBlock->end(), DL, TII,
MF->getFunction().getCallingConv() ==		MF->getFunction().getCallingConv() ==
CallingConv::AMDGPU_PS);		CallingConv::AMDGPU_PS);
EarlyExitClearsExec = false;		EarlyExitClearsExec = false;
}		}

if (ClearExec && !EarlyExitClearsExec) {		if (ClearExec && !EarlyExitClearsExec) {
const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();
unsigned Mov = ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
Register Exec = ST.isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
auto ExitI = EarlyExitBlock->getFirstNonPHI();		auto ExitI = EarlyExitBlock->getFirstNonPHI();
BuildMI(*EarlyExitBlock, ExitI, DL, TII->get(Mov), Exec).addImm(0);		BuildMI(*EarlyExitBlock, ExitI, DL, TII->get(MovOpc), ExecReg).addImm(0);
EarlyExitClearsExec = true;		EarlyExitClearsExec = true;
}		}
}		}

static void splitBlock(MachineBasicBlock &MBB, MachineInstr &MI,		static void splitBlock(MachineBasicBlock &MBB, MachineInstr &MI,
MachineDominatorTree *MDT) {		MachineDominatorTree *MDT) {
MachineBasicBlock SplitBB = MBB.splitAt(MI, /UpdateLiveIns*/ true);		MachineBasicBlock SplitBB = MBB.splitAt(MI, /UpdateLiveIns*/ true);

// Update dominator tree		// Update dominator tree
using DomTreeT = DomTreeBase<MachineBasicBlock>;		using DomTreeT = DomTreeBase<MachineBasicBlock>;
SmallVector<DomTreeT::UpdateType, 16> DTUpdates;		SmallVector<DomTreeT::UpdateType, 16> DTUpdates;
for (MachineBasicBlock *Succ : SplitBB->successors()) {		for (MachineBasicBlock *Succ : SplitBB->successors()) {
DTUpdates.push_back({DomTreeT::Insert, SplitBB, Succ});		DTUpdates.push_back({DomTreeT::Insert, SplitBB, Succ});
DTUpdates.push_back({DomTreeT::Delete, &MBB, Succ});		DTUpdates.push_back({DomTreeT::Delete, &MBB, Succ});
}		}
DTUpdates.push_back({DomTreeT::Insert, &MBB, SplitBB});		DTUpdates.push_back({DomTreeT::Insert, &MBB, SplitBB});
MDT->getBase().applyUpdates(DTUpdates);		MDT->getBase().applyUpdates(DTUpdates);
}		}

/// Insert an "if exec=0 { null export; s_endpgm }" sequence before the given
/// iterator. Only applies to pixel shaders.
void SIInsertSkips::skipIfDead(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I, DebugLoc DL) {
MachineFunction *MF = MBB.getParent();
(void)MF;
assert(MF->getFunction().getCallingConv() == CallingConv::AMDGPU_PS);

// It is possible for an SI_KILL_*_TERMINATOR to sit at the bottom of a
// basic block that has no further successors (e.g., there was an
// `unreachable` there in IR). This can happen with original source of the
// form:
//
// if (uniform_condition) {
// write_to_memory();
// discard;
// }
//
// In this case, we write the "null_export; s_endpgm" skip code in the
// already-existing basic block.
auto NextBBI = std::next(MBB.getIterator());
bool NoSuccessor =
I == MBB.end() && !llvm::is_contained(MBB.successors(), &*NextBBI);

if (NoSuccessor) {
generateEndPgm(MBB, I, DL, TII, true);
} else {
ensureEarlyExitBlock(MBB, false);

MachineInstr *BranchMI =
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
.addMBB(EarlyExitBlock);

// Split the block if the branch will not come at the end.
auto Next = std::next(BranchMI->getIterator());
if (Next != MBB.end() && !Next->isTerminator())
splitBlock(MBB, *BranchMI, MDT);

MBB.addSuccessor(EarlyExitBlock);
MDT->getBase().insertEdge(&MBB, EarlyExitBlock);
}
}

/// Translate a SI_KILL_*_TERMINATOR into exec-manipulating instructions.
/// Return true unless the terminator is a no-op.
bool SIInsertSkips::kill(MachineInstr &MI) {
MachineBasicBlock &MBB = *MI.getParent();
DebugLoc DL = MI.getDebugLoc();

switch (MI.getOpcode()) {
case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR: {
unsigned Opcode = 0;

// The opcodes are inverted because the inline immediate has to be
// the first operand, e.g. from "x < imm" to "imm > x"
switch (MI.getOperand(2).getImm()) {
case ISD::SETOEQ:
case ISD::SETEQ:
Opcode = AMDGPU::V_CMPX_EQ_F32_e64;
break;
case ISD::SETOGT:
case ISD::SETGT:
Opcode = AMDGPU::V_CMPX_LT_F32_e64;
break;
case ISD::SETOGE:
case ISD::SETGE:
Opcode = AMDGPU::V_CMPX_LE_F32_e64;
break;
case ISD::SETOLT:
case ISD::SETLT:
Opcode = AMDGPU::V_CMPX_GT_F32_e64;
break;
case ISD::SETOLE:
case ISD::SETLE:
Opcode = AMDGPU::V_CMPX_GE_F32_e64;
break;
case ISD::SETONE:
case ISD::SETNE:
Opcode = AMDGPU::V_CMPX_LG_F32_e64;
break;
case ISD::SETO:
Opcode = AMDGPU::V_CMPX_O_F32_e64;
break;
case ISD::SETUO:
Opcode = AMDGPU::V_CMPX_U_F32_e64;
break;
case ISD::SETUEQ:
Opcode = AMDGPU::V_CMPX_NLG_F32_e64;
break;
case ISD::SETUGT:
Opcode = AMDGPU::V_CMPX_NGE_F32_e64;
break;
case ISD::SETUGE:
Opcode = AMDGPU::V_CMPX_NGT_F32_e64;
break;
case ISD::SETULT:
Opcode = AMDGPU::V_CMPX_NLE_F32_e64;
break;
case ISD::SETULE:
Opcode = AMDGPU::V_CMPX_NLT_F32_e64;
break;
case ISD::SETUNE:
Opcode = AMDGPU::V_CMPX_NEQ_F32_e64;
break;
default:
llvm_unreachable("invalid ISD:SET cond code");
}

const GCNSubtarget &ST = MBB.getParent()->getSubtarget<GCNSubtarget>();
if (ST.hasNoSdstCMPX())
Opcode = AMDGPU::getVCMPXNoSDstOp(Opcode);

assert(MI.getOperand(0).isReg());

if (TRI->isVGPR(MBB.getParent()->getRegInfo(),
MI.getOperand(0).getReg())) {
Opcode = AMDGPU::getVOPe32(Opcode);
BuildMI(MBB, &MI, DL, TII->get(Opcode))
.add(MI.getOperand(1))
.add(MI.getOperand(0));
} else {
auto I = BuildMI(MBB, &MI, DL, TII->get(Opcode));
if (!ST.hasNoSdstCMPX())
I.addReg(AMDGPU::VCC, RegState::Define);

I.addImm(0) // src0 modifiers
.add(MI.getOperand(1))
.addImm(0) // src1 modifiers
.add(MI.getOperand(0));

I.addImm(0); // omod
}
return true;
}
case AMDGPU::SI_KILL_I1_TERMINATOR: {
const MachineFunction *MF = MI.getParent()->getParent();
const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();
unsigned Exec = ST.isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
const MachineOperand &Op = MI.getOperand(0);
int64_t KillVal = MI.getOperand(1).getImm();
assert(KillVal == 0 \|\| KillVal == -1);

// Kill all threads if Op0 is an immediate and equal to the Kill value.
if (Op.isImm()) {
int64_t Imm = Op.getImm();
assert(Imm == 0 \|\| Imm == -1);

if (Imm == KillVal) {
BuildMI(MBB, &MI, DL, TII->get(ST.isWave32() ? AMDGPU::S_MOV_B32
: AMDGPU::S_MOV_B64), Exec)
.addImm(0);
return true;
}
return false;
}

unsigned Opcode = KillVal ? AMDGPU::S_ANDN2_B64 : AMDGPU::S_AND_B64;
if (ST.isWave32())
Opcode = KillVal ? AMDGPU::S_ANDN2_B32 : AMDGPU::S_AND_B32;
BuildMI(MBB, &MI, DL, TII->get(Opcode), Exec)
.addReg(Exec)
.add(Op);
return true;
}
default:
llvm_unreachable("invalid opcode, expected SI_KILL_*_TERMINATOR");
}
}

void SIInsertSkips::earlyTerm(MachineInstr &MI) {		void SIInsertSkips::earlyTerm(MachineInstr &MI) {
MachineBasicBlock &MBB = *MI.getParent();		MachineBasicBlock &MBB = *MI.getParent();
const DebugLoc DL = MI.getDebugLoc();		const DebugLoc DL = MI.getDebugLoc();

ensureEarlyExitBlock(MBB, true);		ensureEarlyExitBlock(MBB, true);

auto BranchMI = BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_CBRANCH_SCC0))		auto BranchMI = BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_CBRANCH_SCC0))
.addMBB(EarlyExitBlock);		.addMBB(EarlyExitBlock);
auto Next = std::next(MI.getIterator());		auto Next = std::next(MI.getIterator());

if (Next != MBB.end() && !Next->isTerminator())		if (Next != MBB.end() && !Next->isTerminator())
splitBlock(MBB, *BranchMI, MDT);		splitBlock(MBB, *BranchMI, MDT);

MBB.addSuccessor(EarlyExitBlock);		MBB.addSuccessor(EarlyExitBlock);
MDT->getBase().insertEdge(&MBB, EarlyExitBlock);		MDT->getBase().insertEdge(&MBB, EarlyExitBlock);
}		}

		foadUnsubmitted Done Reply Inline Actions Seems like dead code since OldSuccessor is never set to anything useful. foad: Seems like dead code since OldSuccessor is never set to anything useful.
		critsonAuthorUnsubmitted Done Reply Inline Actions This belongs in D94748. critson: This belongs in D94748.
// Returns true if a branch over the block was inserted.		// Returns true if a branch over the block was inserted.
bool SIInsertSkips::skipMaskBranch(MachineInstr &MI,		bool SIInsertSkips::skipMaskBranch(MachineInstr &MI,
MachineBasicBlock &SrcMBB) {		MachineBasicBlock &SrcMBB) {
MachineBasicBlock *DestBB = MI.getOperand(0).getMBB();		MachineBasicBlock *DestBB = MI.getOperand(0).getMBB();

if (!shouldSkip(*SrcMBB.succ_begin(), DestBB))		if (!shouldSkip(*SrcMBB.succ_begin(), DestBB))
return false;		return false;

const DebugLoc &DL = MI.getDebugLoc();		const DebugLoc &DL = MI.getDebugLoc();
MachineBasicBlock::iterator InsPt = std::next(MI.getIterator());		MachineBasicBlock::iterator InsPt = std::next(MI.getIterator());

BuildMI(SrcMBB, InsPt, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))		BuildMI(SrcMBB, InsPt, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
.addMBB(DestBB);		.addMBB(DestBB);

return true;		return true;
}		}

bool SIInsertSkips::runOnMachineFunction(MachineFunction &MF) {		bool SIInsertSkips::runOnMachineFunction(MachineFunction &MF) {
const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();
TRI = &TII->getRegisterInfo();		TRI = &TII->getRegisterInfo();
MDT = &getAnalysis<MachineDominatorTree>();		MDT = &getAnalysis<MachineDominatorTree>();
SkipThreshold = SkipThresholdFlag;		SkipThreshold = SkipThresholdFlag;

SmallVector<MachineInstr *, 4> KillInstrs;		MovOpc = ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
		ExecReg = ST.isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;

SmallVector<MachineInstr *, 4> EarlyTermInstrs;		SmallVector<MachineInstr *, 4> EarlyTermInstrs;
bool MadeChange = false;		bool MadeChange = false;

for (MachineBasicBlock &MBB : MF) {		for (MachineBasicBlock &MBB : MF) {
MachineBasicBlock::iterator I, Next;		MachineBasicBlock::iterator I, Next;
for (I = MBB.begin(); I != MBB.end(); I = Next) {		for (I = MBB.begin(); I != MBB.end(); I = Next) {
Next = std::next(I);		Next = std::next(I);
MachineInstr &MI = *I;		MachineInstr &MI = *I;

switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
case AMDGPU::SI_MASK_BRANCH:		case AMDGPU::SI_MASK_BRANCH:
MadeChange \|= skipMaskBranch(MI, MBB);		MadeChange \|= skipMaskBranch(MI, MBB);
break;		break;

case AMDGPU::S_BRANCH:		case AMDGPU::S_BRANCH:
// Optimize out branches to the next block.		// Optimize out branches to the next block.
// FIXME: Shouldn't this be handled by BranchFolding?		// FIXME: Shouldn't this be handled by BranchFolding?
if (MBB.isLayoutSuccessor(MI.getOperand(0).getMBB())) {		if (MBB.isLayoutSuccessor(MI.getOperand(0).getMBB())) {
assert(&MI == &MBB.back());		assert(&MI == &MBB.back());
MI.eraseFromParent();		MI.eraseFromParent();
MadeChange = true;		MadeChange = true;
}		}
break;		break;

case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
case AMDGPU::SI_KILL_I1_TERMINATOR: {
MadeChange = true;
bool CanKill = kill(MI);

// Check if we can add an early "if exec=0 { end shader }".
//
// Note that we _always_ do this if it is correct, even if the kill
// happens fairly late in the shader, because the null export should
// generally still be cheaper than normal export(s).
//
// TODO: The dominatesAllReachable check is conservative: if the
// dominance is only missing due to _uniform_ branches, we could
// in fact insert the early-exit as well.
if (CanKill &&
MF.getFunction().getCallingConv() == CallingConv::AMDGPU_PS &&
dominatesAllReachable(MBB)) {
// Mark the instruction for kill-if-dead insertion. We delay this
// change because it modifies the CFG.
KillInstrs.push_back(&MI);
} else {
MI.eraseFromParent();
}
break;
}

case AMDGPU::SI_KILL_CLEANUP:
if (MF.getFunction().getCallingConv() == CallingConv::AMDGPU_PS &&
dominatesAllReachable(MBB)) {
KillInstrs.push_back(&MI);
} else {
MI.eraseFromParent();
}
break;

case AMDGPU::SI_EARLY_TERMINATE_SCC0:		case AMDGPU::SI_EARLY_TERMINATE_SCC0:
EarlyTermInstrs.push_back(&MI);		EarlyTermInstrs.push_back(&MI);
break;		break;

default:		default:
break;		break;
}		}
}		}
}		}

for (MachineInstr *Instr : EarlyTermInstrs) {		for (MachineInstr *Instr : EarlyTermInstrs) {
// Early termination in GS does nothing		// Early termination in GS does nothing
if (MF.getFunction().getCallingConv() != CallingConv::AMDGPU_GS)		if (MF.getFunction().getCallingConv() != CallingConv::AMDGPU_GS)
earlyTerm(*Instr);		earlyTerm(*Instr);
Instr->eraseFromParent();		Instr->eraseFromParent();
}		}
for (MachineInstr *Kill : KillInstrs) {
skipIfDead(*Kill->getParent(), std::next(Kill->getIterator()),
Kill->getDebugLoc());
Kill->eraseFromParent();
}
KillInstrs.clear();
EarlyTermInstrs.clear();		EarlyTermInstrs.clear();
EarlyExitBlock = nullptr;		EarlyExitBlock = nullptr;

return MadeChange;		return MadeChange;
}		}

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,633 Lines • ▼ Show 20 Lines	case AMDGPU::S_ANDN2_B64_term:
break;		break;

case AMDGPU::S_ANDN2_B32_term:		case AMDGPU::S_ANDN2_B32_term:
// This is only a terminator to get the correct spill code placement during		// This is only a terminator to get the correct spill code placement during
// register allocation.		// register allocation.
MI.setDesc(get(AMDGPU::S_ANDN2_B32));		MI.setDesc(get(AMDGPU::S_ANDN2_B32));
break;		break;

		case AMDGPU::S_AND_B64_term:
		// This is only a terminator to get the correct spill code placement during
		// register allocation.
		MI.setDesc(get(AMDGPU::S_AND_B64));
		break;

		case AMDGPU::S_AND_B32_term:
		// This is only a terminator to get the correct spill code placement during
		// register allocation.
		MI.setDesc(get(AMDGPU::S_AND_B32));
		break;

case AMDGPU::V_MOV_B64_PSEUDO: {		case AMDGPU::V_MOV_B64_PSEUDO: {
Register Dst = MI.getOperand(0).getReg();		Register Dst = MI.getOperand(0).getReg();
Register DstLo = RI.getSubReg(Dst, AMDGPU::sub0);		Register DstLo = RI.getSubReg(Dst, AMDGPU::sub0);
Register DstHi = RI.getSubReg(Dst, AMDGPU::sub1);		Register DstHi = RI.getSubReg(Dst, AMDGPU::sub1);

const MachineOperand &SrcOp = MI.getOperand(1);		const MachineOperand &SrcOp = MI.getOperand(1);
// FIXME: Will this work for 64-bit floating point immediates?		// FIXME: Will this work for 64-bit floating point immediates?
assert(!SrcOp.isFPImm());		assert(!SrcOp.isFPImm());
▲ Show 20 Lines • Show All 615 Lines • ▼ Show 20 Lines	bool SIInstrInfo::analyzeBranch(MachineBasicBlock &MBB, MachineBasicBlock *&TBB,
while (I != E && !I->isBranch() && !I->isReturn() &&		while (I != E && !I->isBranch() && !I->isReturn() &&
I->getOpcode() != AMDGPU::SI_MASK_BRANCH) {		I->getOpcode() != AMDGPU::SI_MASK_BRANCH) {
switch (I->getOpcode()) {		switch (I->getOpcode()) {
case AMDGPU::SI_MASK_BRANCH:		case AMDGPU::SI_MASK_BRANCH:
case AMDGPU::S_MOV_B64_term:		case AMDGPU::S_MOV_B64_term:
case AMDGPU::S_XOR_B64_term:		case AMDGPU::S_XOR_B64_term:
case AMDGPU::S_OR_B64_term:		case AMDGPU::S_OR_B64_term:
case AMDGPU::S_ANDN2_B64_term:		case AMDGPU::S_ANDN2_B64_term:
		case AMDGPU::S_AND_B64_term:
case AMDGPU::S_MOV_B32_term:		case AMDGPU::S_MOV_B32_term:
case AMDGPU::S_XOR_B32_term:		case AMDGPU::S_XOR_B32_term:
case AMDGPU::S_OR_B32_term:		case AMDGPU::S_OR_B32_term:
case AMDGPU::S_ANDN2_B32_term:		case AMDGPU::S_ANDN2_B32_term:
		case AMDGPU::S_AND_B32_term:
break;		break;
case AMDGPU::SI_IF:		case AMDGPU::SI_IF:
case AMDGPU::SI_ELSE:		case AMDGPU::SI_ELSE:
case AMDGPU::SI_KILL_I1_TERMINATOR:		case AMDGPU::SI_KILL_I1_TERMINATOR:
case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:		case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
// FIXME: It's messy that these need to be considered here at all.		// FIXME: It's messy that these need to be considered here at all.
return true;		return true;
default:		default:
▲ Show 20 Lines • Show All 5,150 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	class WrapTerminatorInst<SOP_Pseudo base_inst> : SPseudoInstSI<
let SchedRW = base_inst.SchedRW;		let SchedRW = base_inst.SchedRW;
}		}

let WaveSizePredicate = isWave64 in {		let WaveSizePredicate = isWave64 in {
def S_MOV_B64_term : WrapTerminatorInst<S_MOV_B64>;		def S_MOV_B64_term : WrapTerminatorInst<S_MOV_B64>;
def S_XOR_B64_term : WrapTerminatorInst<S_XOR_B64>;		def S_XOR_B64_term : WrapTerminatorInst<S_XOR_B64>;
def S_OR_B64_term : WrapTerminatorInst<S_OR_B64>;		def S_OR_B64_term : WrapTerminatorInst<S_OR_B64>;
def S_ANDN2_B64_term : WrapTerminatorInst<S_ANDN2_B64>;		def S_ANDN2_B64_term : WrapTerminatorInst<S_ANDN2_B64>;
		def S_AND_B64_term : WrapTerminatorInst<S_AND_B64>;
}		}

let WaveSizePredicate = isWave32 in {		let WaveSizePredicate = isWave32 in {
def S_MOV_B32_term : WrapTerminatorInst<S_MOV_B32>;		def S_MOV_B32_term : WrapTerminatorInst<S_MOV_B32>;
def S_XOR_B32_term : WrapTerminatorInst<S_XOR_B32>;		def S_XOR_B32_term : WrapTerminatorInst<S_XOR_B32>;
def S_OR_B32_term : WrapTerminatorInst<S_OR_B32>;		def S_OR_B32_term : WrapTerminatorInst<S_OR_B32>;
def S_ANDN2_B32_term : WrapTerminatorInst<S_ANDN2_B32>;		def S_ANDN2_B32_term : WrapTerminatorInst<S_ANDN2_B32>;
		def S_AND_B32_term : WrapTerminatorInst<S_AND_B32>;
}		}


def WAVE_BARRIER : SPseudoInstSI<(outs), (ins),		def WAVE_BARRIER : SPseudoInstSI<(outs), (ins),
[(int_amdgcn_wave_barrier)]> {		[(int_amdgcn_wave_barrier)]> {
let SchedRW = [];		let SchedRW = [];
let hasNoSchedulingInfo = 1;		let hasNoSchedulingInfo = 1;
let hasSideEffects = 1;		let hasSideEffects = 1;
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
let Uses = [EXEC] in {		let Uses = [EXEC] in {

multiclass PseudoInstKill <dag ins> {		multiclass PseudoInstKill <dag ins> {
// Even though this pseudo can usually be expanded without an SCC def, we		// Even though this pseudo can usually be expanded without an SCC def, we
// conservatively assume that it has an SCC def, both because it is sometimes		// conservatively assume that it has an SCC def, both because it is sometimes
// required in degenerate cases (when V_CMPX cannot be used due to constant		// required in degenerate cases (when V_CMPX cannot be used due to constant
// bus limitations) and because it allows us to avoid having to track SCC		// bus limitations) and because it allows us to avoid having to track SCC
// liveness across basic blocks.		// liveness across basic blocks.
let Defs = [EXEC,VCC,SCC] in		let Defs = [EXEC,SCC] in
def _PSEUDO : PseudoInstSI <(outs), ins> {		def _PSEUDO : PseudoInstSI <(outs), ins> {
let isConvergent = 1;		let isConvergent = 1;
let usesCustomInserter = 1;		let usesCustomInserter = 1;
}		}

let Defs = [EXEC,VCC,SCC] in		let Defs = [EXEC,SCC] in
def _TERMINATOR : SPseudoInstSI <(outs), ins> {		def _TERMINATOR : SPseudoInstSI <(outs), ins> {
let isTerminator = 1;		let isTerminator = 1;
}		}
}		}

defm SI_KILL_I1 : PseudoInstKill <(ins SCSrc_i1:$src, i1imm:$killvalue)>;		defm SI_KILL_I1 : PseudoInstKill <(ins SCSrc_i1:$src, i1imm:$killvalue)>;
		let Defs = [VCC] in
defm SI_KILL_F32_COND_IMM : PseudoInstKill <(ins VSrc_b32:$src0, i32imm:$src1, i32imm:$cond)>;		defm SI_KILL_F32_COND_IMM : PseudoInstKill <(ins VSrc_b32:$src0, i32imm:$src1, i32imm:$cond)>;

let Defs = [EXEC] in
def SI_KILL_CLEANUP : SPseudoInstSI <(outs), (ins)>;

let Defs = [EXEC,VCC] in		let Defs = [EXEC,VCC] in
def SI_ILLEGAL_COPY : SPseudoInstSI <		def SI_ILLEGAL_COPY : SPseudoInstSI <
(outs unknown:$dst), (ins unknown:$src),		(outs unknown:$dst), (ins unknown:$src),
[], " ; illegal copy $src to $dst">;		[], " ; illegal copy $src to $dst">;

} // End Uses = [EXEC], Defs = [EXEC,VCC]		} // End Uses = [EXEC], Defs = [EXEC,VCC]

// Branch on undef scc. Used to avoid intermediate copy from		// Branch on undef scc. Used to avoid intermediate copy from
▲ Show 20 Lines • Show All 2,323 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
class SILowerControlFlow : public MachineFunctionPass {		class SILowerControlFlow : public MachineFunctionPass {
private:		private:
const SIRegisterInfo *TRI = nullptr;		const SIRegisterInfo *TRI = nullptr;
const SIInstrInfo *TII = nullptr;		const SIInstrInfo *TII = nullptr;
LiveIntervals *LIS = nullptr;		LiveIntervals *LIS = nullptr;
MachineRegisterInfo *MRI = nullptr;		MachineRegisterInfo *MRI = nullptr;
SetVector<MachineInstr*> LoweredEndCf;		SetVector<MachineInstr*> LoweredEndCf;
DenseSet<Register> LoweredIf;		DenseSet<Register> LoweredIf;
SmallSet<MachineInstr *, 16> NeedsKillCleanup;

const TargetRegisterClass *BoolRC = nullptr;		const TargetRegisterClass *BoolRC = nullptr;
bool InsertKillCleanups;
unsigned AndOpc;		unsigned AndOpc;
unsigned OrOpc;		unsigned OrOpc;
unsigned XorOpc;		unsigned XorOpc;
unsigned MovTermOpc;		unsigned MovTermOpc;
unsigned Andn2TermOpc;		unsigned Andn2TermOpc;
unsigned XorTermrOpc;		unsigned XorTermrOpc;
unsigned OrTermrOpc;		unsigned OrTermrOpc;
unsigned OrSaveExecOpc;		unsigned OrSaveExecOpc;
▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	void SILowerControlFlow::emitIf(MachineInstr &MI) {
MachineOperand &ImpDefSCC = MI.getOperand(4);		MachineOperand &ImpDefSCC = MI.getOperand(4);
assert(ImpDefSCC.getReg() == AMDGPU::SCC && ImpDefSCC.isDef());		assert(ImpDefSCC.getReg() == AMDGPU::SCC && ImpDefSCC.isDef());

// If there is only one use of save exec register and that use is SI_END_CF,		// If there is only one use of save exec register and that use is SI_END_CF,
// we can optimize SI_IF by returning the full saved exec mask instead of		// we can optimize SI_IF by returning the full saved exec mask instead of
// just cleared bits.		// just cleared bits.
bool SimpleIf = isSimpleIf(MI, MRI);		bool SimpleIf = isSimpleIf(MI, MRI);

if (InsertKillCleanups) {		if (SimpleIf) {
// Check for SI_KILL_*_TERMINATOR on full path of control flow and
// flag the associated SI_END_CF for insertion of a kill cleanup.
auto UseMI = MRI->use_instr_nodbg_begin(SaveExecReg);
while (UseMI->getOpcode() != AMDGPU::SI_END_CF) {
assert(std::next(UseMI) == MRI->use_instr_nodbg_end());
assert(UseMI->getOpcode() == AMDGPU::SI_ELSE);
MachineOperand &NextExec = UseMI->getOperand(0);
Register NextExecReg = NextExec.getReg();
if (NextExec.isDead()) {
assert(!SimpleIf);
break;
}
UseMI = MRI->use_instr_nodbg_begin(NextExecReg);
}
if (UseMI->getOpcode() == AMDGPU::SI_END_CF) {
if (hasKill(MI.getParent(), UseMI->getParent(), TII)) {
NeedsKillCleanup.insert(&*UseMI);
SimpleIf = false;
}
}
} else if (SimpleIf) {
// Check for SI_KILL_*_TERMINATOR on path from if to endif.		// Check for SI_KILL_*_TERMINATOR on path from if to endif.
// if there is any such terminator simplifications are not safe.		// if there is any such terminator simplifications are not safe.
auto UseMI = MRI->use_instr_nodbg_begin(SaveExecReg);		auto UseMI = MRI->use_instr_nodbg_begin(SaveExecReg);
SimpleIf = !hasKill(MI.getParent(), UseMI->getParent(), TII);		SimpleIf = !hasKill(MI.getParent(), UseMI->getParent(), TII);
}		}

// Add an implicit def of exec to discourage scheduling VALU after this which		// Add an implicit def of exec to discourage scheduling VALU after this which
// will interfere with trying to form s_and_saveexec_b64 later.		// will interfere with trying to form s_and_saveexec_b64 later.
▲ Show 20 Lines • Show All 202 Lines • ▼ Show 20 Lines	SILowerControlFlow::skipIgnoreExecInstsTrivialSucc(
SmallSet<const MachineBasicBlock *, 4> Visited;		SmallSet<const MachineBasicBlock *, 4> Visited;
MachineBasicBlock *B = &MBB;		MachineBasicBlock *B = &MBB;
do {		do {
if (!Visited.insert(B).second)		if (!Visited.insert(B).second)
return MBB.end();		return MBB.end();

auto E = B->end();		auto E = B->end();
for ( ; It != E; ++It) {		for ( ; It != E; ++It) {
if (It->getOpcode() == AMDGPU::SI_KILL_CLEANUP)
continue;
if (TII->mayReadEXEC(MRI, It))		if (TII->mayReadEXEC(MRI, It))
break;		break;
}		}

if (It != E)		if (It != E)
return It;		return It;

if (B->succ_size() != 1)		if (B->succ_size() != 1)
Show All 36 Lines	MachineBasicBlock *SILowerControlFlow::emitEndCf(MachineInstr &MI) {

MachineInstr *NewMI =		MachineInstr *NewMI =
BuildMI(MBB, InsPt, DL, TII->get(Opcode), Exec)		BuildMI(MBB, InsPt, DL, TII->get(Opcode), Exec)
.addReg(Exec)		.addReg(Exec)
.add(MI.getOperand(0));		.add(MI.getOperand(0));

LoweredEndCf.insert(NewMI);		LoweredEndCf.insert(NewMI);

// If this ends control flow which contains kills (as flagged in emitIf)		if (LIS)
// then insert an SI_KILL_CLEANUP immediately following the exec mask
// manipulation. This can be lowered to early termination if appropriate.
MachineInstr *CleanUpMI = nullptr;
if (NeedsKillCleanup.count(&MI))
CleanUpMI = BuildMI(MBB, InsPt, DL, TII->get(AMDGPU::SI_KILL_CLEANUP));

if (LIS) {
LIS->ReplaceMachineInstrInMaps(MI, *NewMI);		LIS->ReplaceMachineInstrInMaps(MI, *NewMI);
if (CleanUpMI)
LIS->InsertMachineInstrInMaps(*CleanUpMI);
}

MI.eraseFromParent();		MI.eraseFromParent();

if (LIS)		if (LIS)
LIS->handleMove(*NewMI);		LIS->handleMove(*NewMI);
return SplitBB;		return SplitBB;
}		}

▲ Show 20 Lines • Show All 194 Lines • ▼ Show 20 Lines	bool SILowerControlFlow::runOnMachineFunction(MachineFunction &MF) {
const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();
TRI = &TII->getRegisterInfo();		TRI = &TII->getRegisterInfo();

// This doesn't actually need LiveIntervals, but we can preserve them.		// This doesn't actually need LiveIntervals, but we can preserve them.
LIS = getAnalysisIfAvailable<LiveIntervals>();		LIS = getAnalysisIfAvailable<LiveIntervals>();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
BoolRC = TRI->getBoolRC();		BoolRC = TRI->getBoolRC();
InsertKillCleanups =
MF.getFunction().getCallingConv() == CallingConv::AMDGPU_PS;

if (ST.isWave32()) {		if (ST.isWave32()) {
AndOpc = AMDGPU::S_AND_B32;		AndOpc = AMDGPU::S_AND_B32;
OrOpc = AMDGPU::S_OR_B32;		OrOpc = AMDGPU::S_OR_B32;
XorOpc = AMDGPU::S_XOR_B32;		XorOpc = AMDGPU::S_XOR_B32;
MovTermOpc = AMDGPU::S_MOV_B32_term;		MovTermOpc = AMDGPU::S_MOV_B32_term;
Andn2TermOpc = AMDGPU::S_ANDN2_B32_term;		Andn2TermOpc = AMDGPU::S_ANDN2_B32_term;
XorTermrOpc = AMDGPU::S_XOR_B32_term;		XorTermrOpc = AMDGPU::S_XOR_B32_term;
Show All 32 Lines	for (I = MBB->begin(); I != E; I = Next) {
SplitMBB = process(MI);		SplitMBB = process(MI);
break;		break;

case AMDGPU::SI_ELSE:		case AMDGPU::SI_ELSE:
case AMDGPU::SI_IF_BREAK:		case AMDGPU::SI_IF_BREAK:
case AMDGPU::SI_LOOP:		case AMDGPU::SI_LOOP:
case AMDGPU::SI_END_CF:		case AMDGPU::SI_END_CF:
// Only build worklist if SI_IF instructions must be processed first.		// Only build worklist if SI_IF instructions must be processed first.
if (InsertKillCleanups)
Worklist.push_back(&MI);
else
SplitMBB = process(MI);		SplitMBB = process(MI);
break;		break;

default:		default:
break;		break;
}		}

if (SplitMBB != MBB) {		if (SplitMBB != MBB) {
MBB = Next->getParent();		MBB = Next->getParent();
E = MBB->end();		E = MBB->end();
}		}
}		}
}		}

for (MachineInstr *MI : Worklist)		for (MachineInstr *MI : Worklist)
process(*MI);		process(*MI);

optimizeEndCf();		optimizeEndCf();

LoweredEndCf.clear();		LoweredEndCf.clear();
LoweredIf.clear();		LoweredIf.clear();
NeedsKillCleanup.clear();

return true;		return true;
}		}

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp

Show First 20 Lines • Show All 213 Lines • ▼ Show 20 Lines	case AMDGPU::S_ANDN2_B64_term: {
return true;		return true;
}		}
case AMDGPU::S_ANDN2_B32_term: {		case AMDGPU::S_ANDN2_B32_term: {
// This is only a terminator to get the correct spill code placement during		// This is only a terminator to get the correct spill code placement during
// register allocation.		// register allocation.
MI.setDesc(TII.get(AMDGPU::S_ANDN2_B32));		MI.setDesc(TII.get(AMDGPU::S_ANDN2_B32));
return true;		return true;
}		}
		case AMDGPU::S_AND_B64_term: {
		// This is only a terminator to get the correct spill code placement during
		// register allocation.
		MI.setDesc(TII.get(AMDGPU::S_AND_B64));
		return true;
		}
		case AMDGPU::S_AND_B32_term: {
		// This is only a terminator to get the correct spill code placement during
		// register allocation.
		MI.setDesc(TII.get(AMDGPU::S_AND_B32));
		return true;
		}
default:		default:
return false;		return false;
}		}
}		}

// Turn all pseudoterminators in the block into their equivalent non-terminator		// Turn all pseudoterminators in the block into their equivalent non-terminator
// instructions. Returns the reverse iterator to the first non-terminator		// instructions. Returns the reverse iterator to the first non-terminator
// instruction in the block.		// instruction in the block.
▲ Show 20 Lines • Show All 221 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

//===-- SIWholeQuadMode.cpp - enter and suspend whole quad mode -----------===//		//===-- SIWholeQuadMode.cpp - enter and suspend whole quad mode -----------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
/// \file		/// \file
/// This pass adds instructions to enable whole quad mode for pixel		/// This pass adds instructions to enable whole quad mode for pixel
/// shaders, and whole wavefront mode for all programs.		/// shaders, and whole wavefront mode for all programs.
///		///
/// Whole quad mode is required for derivative computations, but it interferes		/// Whole quad mode is required for derivative computations, but it interferes
/// with shader side effects (stores and atomics). This pass is run on the		/// with shader side effects (stores and atomics). It ensures that WQM is
/// scheduled machine IR but before register coalescing, so that machine SSA is		/// enabled when necessary, but disabled around stores and atomics.
foadUnsubmitted Done Reply Inline Actions Are you changing whether or not this pass can assume SSA form? foad: Are you changing whether or not this pass can assume SSA form?
critsonAuthorUnsubmitted Done Reply Inline Actions Yes, since the pass is now run after scheduler drop support for SSA form. Supporting live mask tracking is much simpler non-SSA (no need to add virtual registers and PHIs tracking through every program block). While I could support both, the code in WQM pass would be very large for a behaviour we are not using. critson: Yes, since the pass is now run after scheduler drop support for SSA form. Supporting live mask…
/// available for analysis. It ensures that WQM is enabled when necessary, but
/// disabled around stores and atomics.
///		///
/// When necessary, this pass creates a function prolog		/// When necessary, this pass creates a function prolog
///		///
/// S_MOV_B64 LiveMask, EXEC		/// S_MOV_B64 LiveMask, EXEC
/// S_WQM_B64 EXEC, EXEC		/// S_WQM_B64 EXEC, EXEC
///		///
/// to enter WQM at the top of the function and surrounds blocks of Exact		/// to enter WQM at the top of the function and surrounds blocks of Exact
/// instructions by		/// instructions by
Show All 30 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/PostOrderIterator.h"		#include "llvm/ADT/PostOrderIterator.h"
#include "llvm/CodeGen/LiveIntervals.h"		#include "llvm/CodeGen/LiveIntervals.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
		#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h"		#include "llvm/CodeGen/MachineInstr.h"
		#include "llvm/CodeGen/MachinePostDominators.h"
#include "llvm/IR/CallingConv.h"		#include "llvm/IR/CallingConv.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "si-wqm"		#define DEBUG_TYPE "si-wqm"

Show All 36 Lines	struct InstrInfo {
char Disabled = 0;		char Disabled = 0;
char OutNeeds = 0;		char OutNeeds = 0;
};		};

struct BlockInfo {		struct BlockInfo {
char Needs = 0;		char Needs = 0;
char InNeeds = 0;		char InNeeds = 0;
char OutNeeds = 0;		char OutNeeds = 0;
		char InitialState = 0;
		bool NeedsLowering = false;
};		};

struct WorkItem {		struct WorkItem {
MachineBasicBlock *MBB = nullptr;		MachineBasicBlock *MBB = nullptr;
MachineInstr *MI = nullptr;		MachineInstr *MI = nullptr;

WorkItem() = default;		WorkItem() = default;
WorkItem(MachineBasicBlock *MBB) : MBB(MBB) {}		WorkItem(MachineBasicBlock *MBB) : MBB(MBB) {}
WorkItem(MachineInstr *MI) : MI(MI) {}		WorkItem(MachineInstr *MI) : MI(MI) {}
};		};

class SIWholeQuadMode : public MachineFunctionPass {		class SIWholeQuadMode : public MachineFunctionPass {
private:		private:
CallingConv::ID CallingConv;
const SIInstrInfo *TII;		const SIInstrInfo *TII;
const SIRegisterInfo *TRI;		const SIRegisterInfo *TRI;
const GCNSubtarget *ST;		const GCNSubtarget *ST;
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
LiveIntervals *LIS;		LiveIntervals *LIS;
		MachineDominatorTree *MDT;
		MachinePostDominatorTree *PDT;

unsigned AndOpc;		unsigned AndOpc;
unsigned XorTermrOpc;		unsigned AndN2Opc;
		unsigned XorOpc;
		unsigned AndSaveExecOpc;
unsigned OrSaveExecOpc;		unsigned OrSaveExecOpc;
unsigned Exec;		unsigned WQMOpc;
		Register Exec;
		Register LiveMaskReg;

DenseMap<const MachineInstr *, InstrInfo> Instructions;		DenseMap<const MachineInstr *, InstrInfo> Instructions;
MapVector<MachineBasicBlock *, BlockInfo> Blocks;		MapVector<MachineBasicBlock *, BlockInfo> Blocks;
SmallVector<MachineInstr *, 1> LiveMaskQueries;
		// Tracks state (WQM/WWM/Exact) after a given instruction
		DenseMap<const MachineInstr *, char> StateTransition;

		SmallVector<MachineInstr *, 2> LiveMaskQueries;
		foadUnsubmitted Done Reply Inline Actions I think this was changed to a MapVector to give a stable iteration order, so changing it back to DenseMap seems dangerous. foad: I think this was changed to a MapVector to give a stable iteration order, so changing it back…
		critsonAuthorUnsubmitted Done Reply Inline Actions It seems that change occurred during the development of this patch. I missed that, so failed to incorporate it. critson: It seems that change occurred during the development of this patch. I missed that, so failed…
SmallVector<MachineInstr *, 4> LowerToMovInstrs;		SmallVector<MachineInstr *, 4> LowerToMovInstrs;
SmallVector<MachineInstr *, 4> LowerToCopyInstrs;		SmallVector<MachineInstr *, 4> LowerToCopyInstrs;
		SmallVector<MachineInstr *, 4> KillInstrs;

void printInfo();		void printInfo();

void markInstruction(MachineInstr &MI, char Flag,		void markInstruction(MachineInstr &MI, char Flag,
std::vector<WorkItem> &Worklist);		std::vector<WorkItem> &Worklist);
void markDefs(const MachineInstr &UseMI, LiveRange &LR, Register Reg,		void markDefs(const MachineInstr &UseMI, LiveRange &LR, Register Reg,
unsigned SubReg, char Flag, std::vector<WorkItem> &Worklist);		unsigned SubReg, char Flag, std::vector<WorkItem> &Worklist);
void markInstructionUses(const MachineInstr &MI, char Flag,		void markInstructionUses(const MachineInstr &MI, char Flag,
std::vector<WorkItem> &Worklist);		std::vector<WorkItem> &Worklist);
char scanInstructions(MachineFunction &MF, std::vector<WorkItem> &Worklist);		char scanInstructions(MachineFunction &MF, std::vector<WorkItem> &Worklist);
void propagateInstruction(MachineInstr &MI, std::vector<WorkItem> &Worklist);		void propagateInstruction(MachineInstr &MI, std::vector<WorkItem> &Worklist);
void propagateBlock(MachineBasicBlock &MBB, std::vector<WorkItem> &Worklist);		void propagateBlock(MachineBasicBlock &MBB, std::vector<WorkItem> &Worklist);
char analyzeFunction(MachineFunction &MF);		char analyzeFunction(MachineFunction &MF);

MachineBasicBlock::iterator saveSCC(MachineBasicBlock &MBB,		MachineBasicBlock::iterator saveSCC(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before);		MachineBasicBlock::iterator Before);
MachineBasicBlock::iterator		MachineBasicBlock::iterator
prepareInsertion(MachineBasicBlock &MBB, MachineBasicBlock::iterator First,		prepareInsertion(MachineBasicBlock &MBB, MachineBasicBlock::iterator First,
MachineBasicBlock::iterator Last, bool PreferLast,		MachineBasicBlock::iterator Last, bool PreferLast,
bool SaveSCC);		bool SaveSCC);
void toExact(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void toExact(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SaveWQM, unsigned LiveMaskReg);		Register SaveWQM);
void toWQM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void toWQM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SavedWQM);		Register SavedWQM);
void toWWM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void toWWM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SaveOrig);		Register SaveOrig);
void fromWWM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void fromWWM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SavedOrig);		Register SavedOrig, char NonWWMState);
void processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg, bool isEntry);
		MachineBasicBlock splitBlock(MachineBasicBlock BB, MachineInstr *TermMI);

		MachineInstr *lowerKillI1(MachineBasicBlock &MBB, MachineInstr &MI,
		bool IsWQM);
		MachineInstr *lowerKillF32(MachineBasicBlock &MBB, MachineInstr &MI);

void lowerLiveMaskQueries(unsigned LiveMaskReg);		void lowerBlock(MachineBasicBlock &MBB);
		void processBlock(MachineBasicBlock &MBB, bool IsEntry);

		void lowerLiveMaskQueries();
void lowerCopyInstrs();		void lowerCopyInstrs();
		void lowerKillInstrs(bool IsWQM);

public:		public:
static char ID;		static char ID;

SIWholeQuadMode() :		SIWholeQuadMode() :
MachineFunctionPass(ID) { }		MachineFunctionPass(ID) { }

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

StringRef getPassName() const override { return "SI Whole Quad Mode"; }		StringRef getPassName() const override { return "SI Whole Quad Mode"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<LiveIntervals>();		AU.addRequired<LiveIntervals>();
		foadUnsubmitted Done Reply Inline Actions I don't think you need to "require" this since you don't explicitly use it for anything. foad: I don't think you need to "require" this since you don't explicitly use it for anything.
AU.addPreserved<SlotIndexes>();		AU.addPreserved<SlotIndexes>();
AU.addPreserved<LiveIntervals>();		AU.addPreserved<LiveIntervals>();
AU.setPreservesCFG();		AU.addRequired<MachineDominatorTree>();
		AU.addPreserved<MachineDominatorTree>();
		AU.addRequired<MachinePostDominatorTree>();
		AU.addPreserved<MachinePostDominatorTree>();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace

char SIWholeQuadMode::ID = 0;		char SIWholeQuadMode::ID = 0;

INITIALIZE_PASS_BEGIN(SIWholeQuadMode, DEBUG_TYPE, "SI Whole Quad Mode", false,		INITIALIZE_PASS_BEGIN(SIWholeQuadMode, DEBUG_TYPE, "SI Whole Quad Mode", false,
false)		false)
INITIALIZE_PASS_DEPENDENCY(LiveIntervals)		INITIALIZE_PASS_DEPENDENCY(LiveIntervals)
		INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
		INITIALIZE_PASS_DEPENDENCY(MachinePostDominatorTree)
INITIALIZE_PASS_END(SIWholeQuadMode, DEBUG_TYPE, "SI Whole Quad Mode", false,		INITIALIZE_PASS_END(SIWholeQuadMode, DEBUG_TYPE, "SI Whole Quad Mode", false,
false)		false)

char &llvm::SIWholeQuadModeID = SIWholeQuadMode::ID;		char &llvm::SIWholeQuadModeID = SIWholeQuadMode::ID;

FunctionPass *llvm::createSIWholeQuadModePass() {		FunctionPass *llvm::createSIWholeQuadModePass() {
return new SIWholeQuadMode;		return new SIWholeQuadMode;
}		}
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	void SIWholeQuadMode::markInstruction(MachineInstr &MI, char Flag,
II.Needs \|= Flag;		II.Needs \|= Flag;
Worklist.push_back(&MI);		Worklist.push_back(&MI);
}		}

/// Mark all relevant definitions of register \p Reg in usage \p UseMI.		/// Mark all relevant definitions of register \p Reg in usage \p UseMI.
void SIWholeQuadMode::markDefs(const MachineInstr &UseMI, LiveRange &LR,		void SIWholeQuadMode::markDefs(const MachineInstr &UseMI, LiveRange &LR,
Register Reg, unsigned SubReg, char Flag,		Register Reg, unsigned SubReg, char Flag,
std::vector<WorkItem> &Worklist) {		std::vector<WorkItem> &Worklist) {
assert(!MRI->isSSA());

LLVM_DEBUG(dbgs() << "markDefs " << PrintState(Flag) << ": " << UseMI);		LLVM_DEBUG(dbgs() << "markDefs " << PrintState(Flag) << ": " << UseMI);

LiveQueryResult UseLRQ = LR.Query(LIS->getInstructionIndex(UseMI));		LiveQueryResult UseLRQ = LR.Query(LIS->getInstructionIndex(UseMI));
if (!UseLRQ.valueIn())		if (!UseLRQ.valueIn())
return;		return;

SmallPtrSet<const VNInfo *, 4> Visited;		SmallPtrSet<const VNInfo *, 4> Visited;
SmallVector<const VNInfo *, 4> ToProcess;		SmallVector<const VNInfo *, 4> ToProcess;
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	if (!Reg.isVirtual()) {

for (MCRegUnitIterator RegUnit(Reg.asMCReg(), TRI); RegUnit.isValid();		for (MCRegUnitIterator RegUnit(Reg.asMCReg(), TRI); RegUnit.isValid();
++RegUnit) {		++RegUnit) {
LiveRange &LR = LIS->getRegUnit(*RegUnit);		LiveRange &LR = LIS->getRegUnit(*RegUnit);
const VNInfo *Value = LR.Query(LIS->getInstructionIndex(MI)).valueIn();		const VNInfo *Value = LR.Query(LIS->getInstructionIndex(MI)).valueIn();
if (!Value)		if (!Value)
continue;		continue;

if (MRI->isSSA()) {
// Since we're in machine SSA, we do not need to track physical
// registers across basic blocks.
if (Value->isPHIDef())
continue;
markInstruction(*LIS->getInstructionFromIndex(Value->def), Flag,
Worklist);
} else {
markDefs(MI, LR, *RegUnit, AMDGPU::NoSubRegister, Flag, Worklist);		markDefs(MI, LR, *RegUnit, AMDGPU::NoSubRegister, Flag, Worklist);
}		}
}

continue;		continue;
}		}

if (MRI->isSSA()) {
for (MachineInstr &DefMI : MRI->def_instructions(Use.getReg()))
markInstruction(DefMI, Flag, Worklist);
} else {
LiveRange &LR = LIS->getInterval(Reg);		LiveRange &LR = LIS->getInterval(Reg);
markDefs(MI, LR, Reg, Use.getSubReg(), Flag, Worklist);		markDefs(MI, LR, Reg, Use.getSubReg(), Flag, Worklist);
}		}
}		}
}

// Scan instructions to determine which ones require an Exact execmask and		// Scan instructions to determine which ones require an Exact execmask and
// which ones seed WQM requirements.		// which ones seed WQM requirements.
char SIWholeQuadMode::scanInstructions(MachineFunction &MF,		char SIWholeQuadMode::scanInstructions(MachineFunction &MF,
std::vector<WorkItem> &Worklist) {		std::vector<WorkItem> &Worklist) {
char GlobalFlags = 0;		char GlobalFlags = 0;
bool WQMOutputs = MF.getFunction().hasFnAttribute("amdgpu-ps-wqm-outputs");		bool WQMOutputs = MF.getFunction().hasFnAttribute("amdgpu-ps-wqm-outputs");
SmallVector<MachineInstr *, 4> SetInactiveInstrs;		SmallVector<MachineInstr *, 4> SetInactiveInstrs;
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
Worklist.push_back(&MBB);		Worklist.push_back(&MBB);
}		}
GlobalFlags \|= StateExact;		GlobalFlags \|= StateExact;
III.Disabled = StateWQM \| StateWWM;		III.Disabled = StateWQM \| StateWWM;
continue;		continue;
} else {		} else {
if (Opcode == AMDGPU::SI_PS_LIVE) {		if (Opcode == AMDGPU::SI_PS_LIVE) {
LiveMaskQueries.push_back(&MI);		LiveMaskQueries.push_back(&MI);
		} else if (Opcode == AMDGPU::SI_KILL_I1_TERMINATOR \|\|
		Opcode == AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR) {
		KillInstrs.push_back(&MI);
		BBI.NeedsLowering = true;
} else if (WQMOutputs) {		} else if (WQMOutputs) {
// The function is in machine SSA form, which means that physical		// The function is in machine SSA form, which means that physical
// VGPRs correspond to shader inputs and outputs. Inputs are		// VGPRs correspond to shader inputs and outputs. Inputs are
// only used, outputs are only defined.		// only used, outputs are only defined.
		// FIXME: is this still valid?
for (const MachineOperand &MO : MI.defs()) {		for (const MachineOperand &MO : MI.defs()) {
if (!MO.isReg())		if (!MO.isReg())
continue;		continue;

Register Reg = MO.getReg();		Register Reg = MO.getReg();

if (!Reg.isVirtual() &&		if (!Reg.isVirtual() &&
TRI->hasVectorRegisters(TRI->getPhysRegClass(Reg))) {		TRI->hasVectorRegisters(TRI->getPhysRegClass(Reg))) {
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	SIWholeQuadMode::saveSCC(MachineBasicBlock &MBB,

LIS->InsertMachineInstrInMaps(*Save);		LIS->InsertMachineInstrInMaps(*Save);
LIS->InsertMachineInstrInMaps(*Restore);		LIS->InsertMachineInstrInMaps(*Restore);
LIS->createAndComputeVirtRegInterval(SaveReg);		LIS->createAndComputeVirtRegInterval(SaveReg);

return Restore;		return Restore;
}		}

		MachineBasicBlock SIWholeQuadMode::splitBlock(MachineBasicBlock BB,
		MachineInstr *TermMI) {
		LLVM_DEBUG(dbgs() << "Split block " << printMBBReference(*BB) << " @ "
		<< *TermMI << "\n");

		MachineBasicBlock *SplitBB =
		BB->splitAt(TermMI, /UpdateLiveIns*/ true, LIS);

		// Convert last instruction in block to a terminator.
		piotrUnsubmitted Done Reply Inline Actions Unused? piotr: Unused?
		critsonAuthorUnsubmitted Done Reply Inline Actions This is legacy of an older version that should have been deleted. critson: This is legacy of an older version that should have been deleted.
		// Note: this only covers the expected patterns
		unsigned NewOpcode = 0;
		switch (TermMI->getOpcode()) {
		case AMDGPU::S_AND_B32:
		NewOpcode = AMDGPU::S_AND_B32_term;
		break;
		case AMDGPU::S_AND_B64:
		NewOpcode = AMDGPU::S_AND_B64_term;
		break;
		case AMDGPU::S_MOV_B32:
		piotrUnsubmitted Done Reply Inline Actions A missing word after "in"? piotr: A missing word after "in"?
		NewOpcode = AMDGPU::S_MOV_B32_term;
		break;
		case AMDGPU::S_MOV_B64:
		foadUnsubmitted Done Reply Inline Actions Maybe just choose the new opcode inside the switch, and pull the get/setDesc calls out? foad: Maybe just choose the new opcode inside the switch, and pull the get/setDesc calls out?
		NewOpcode = AMDGPU::S_MOV_B64_term;
		break;
		default:
		break;
		}
		if (NewOpcode)
		TermMI->setDesc(TII->get(NewOpcode));

		if (SplitBB != BB) {
		// Update dominator trees
		using DomTreeT = DomTreeBase<MachineBasicBlock>;
		SmallVector<DomTreeT::UpdateType, 16> DTUpdates;
		for (MachineBasicBlock *Succ : SplitBB->successors()) {
		DTUpdates.push_back({DomTreeT::Insert, SplitBB, Succ});
		DTUpdates.push_back({DomTreeT::Delete, BB, Succ});
		}
		DTUpdates.push_back({DomTreeT::Insert, BB, SplitBB});
		if (MDT)
		MDT->getBase().applyUpdates(DTUpdates);
		if (PDT)
		PDT->getBase().applyUpdates(DTUpdates);

		// Link blocks
		MachineInstr *MI =
		BuildMI(*BB, BB->end(), DebugLoc(), TII->get(AMDGPU::S_BRANCH))
		.addMBB(SplitBB);
		LIS->InsertMachineInstrInMaps(*MI);
		}

		return SplitBB;
		}

		MachineInstr *SIWholeQuadMode::lowerKillF32(MachineBasicBlock &MBB,
		MachineInstr &MI) {
		const DebugLoc &DL = MI.getDebugLoc();
		unsigned Opcode = 0;

		assert(MI.getOperand(0).isReg());

		// Comparison is for live lanes; however here we compute the inverse
		// (killed lanes). This is because VCMP will always generate 0 bits for
		// inactive lanes so a mask of live lanes would not be correct inside
		// control flow.
		// Operands must also be reversed for comparison as inline immediate
		// must be the first argument.

		ISD::CondCode CC = ISD::getSetCCSwappedOperands(ISD::getSetCCInverse(
		foadUnsubmitted Done Reply Inline Actions I'm a bit sceptical of some of the opcodes in this table. E.g. SETOGT --(swap operands)--> SETOLT --(invert)-->SETUGE which is V_CMP_NLT, but the table here says V_CMP_GT. It might be more obvious what's going on if you call getSetCCSwappedOperands and getSetCCInverse first, and then have a simple lookup from SETxx to opcode. Or you could avoid the "invert" by using AND instead of ANDN2? foad: I'm a bit sceptical of some of the opcodes in this table. E.g. SETOGT --(swap operands)-->…
		critsonAuthorUnsubmitted Done Reply Inline Actions I agree these are wrong, but the inversion is required because this will only generate comparison results for active lanes -- so in non-uniform control flow this will not generate a complete mask to use AND. critson: I agree these are wrong, but the inversion is required because this will only generate…
		static_cast<ISD::CondCode>(MI.getOperand(2).getImm()), MVT::f32));

		switch (CC) {
		foadUnsubmitted Done Reply Inline Actions I still don't trust this table! I think the "O" predicates should map to "N" opcodes e.g. SETOLT -> V_CMP_NLT as explained above. foad: I still don't trust this table! I think the "O" predicates should map to "N" opcodes e.g.
		critsonAuthorUnsubmitted Done Reply Inline Actions Are you expecting the table to look like this? case ISD::SETEQ: Opcode = AMDGPU::V_CMP_LG_F32_e64; break; case ISD::SETGT: Opcode = AMDGPU::V_CMP_GE_F32_e64; break; case ISD::SETGE: Opcode = AMDGPU::V_CMP_GT_F32_e64; break; case ISD::SETLT: Opcode = AMDGPU::V_CMP_LE_F32_e64; break; case ISD::SETLE: Opcode = AMDGPU::V_CMP_LT_F32_e64; break; case ISD::SETNE: Opcode = AMDGPU::V_CMP_EQ_F32_e64; break; case ISD::SETO: Opcode = AMDGPU::V_CMP_O_F32_e64; break; case ISD::SETUO: Opcode = AMDGPU::V_CMP_U_F32_e64; break; case ISD::SETOEQ: case ISD::SETUEQ: Opcode = AMDGPU::V_CMP_NEQ_F32_e64; break; case ISD::SETOGT: case ISD::SETUGT: Opcode = AMDGPU::V_CMP_NLT_F32_e64; break; case ISD::SETOGE: case ISD::SETUGE: Opcode = AMDGPU::V_CMP_NLE_F32_e64; break; case ISD::SETOLT: case ISD::SETULT: Opcode = AMDGPU::V_CMP_NGT_F32_e64; break; case ISD::SETOLE: case ISD::SETULE: Opcode = AMDGPU::V_CMP_NGE_F32_e64; break; case ISD::SETONE: case ISD::SETUNE: Opcode = AMDGPU::V_CMP_NLG_F32_e64; break; I am still testing, but I am unsure if VulkanCTS or game shaders can tell the differences. critson: Are you expecting the table to look like this? ``` case ISD::SETEQ: Opcode = AMDGPU…
		foadUnsubmitted Done Reply Inline Actions No I'm expecting it to look like this: case ISD::SETOEQ: case ISD::SETEQ: Opcode = AMDGPU::V_CMP_NEQ_F32_e64; break; case ISD::SETOGT: case ISD::SETGT: Opcode = AMDGPU::V_CMP_NLT_F32_e64; break; case ISD::SETOGE: case ISD::SETGE: Opcode = AMDGPU::V_CMP_NLE_F32_e64; break; case ISD::SETOLT: case ISD::SETLT: Opcode = AMDGPU::V_CMP_NGT_F32_e64; break; case ISD::SETOLE: case ISD::SETLE: Opcode = AMDGPU::V_CMP_NGE_F32_e64; break; case ISD::SETONE: Opcode = AMDGPU::V_CMP_NLG_F32_e64; break; case ISD::SETO: Opcode = AMDGPU::V_CMP_U_F32_e64; break; case ISD::SETUO: Opcode = AMDGPU::V_CMP_O_F32_e64; break; case ISD::SETUEQ: Opcode = AMDGPU::V_CMP_LG_F32_e64; break; case ISD::SETUGT: Opcode = AMDGPU::V_CMP_GE_F32_e64; break; case ISD::SETUGE: Opcode = AMDGPU::V_CMP_GT_F32_e64; break; case ISD::SETULT: Opcode = AMDGPU::V_CMP_LE_F32_e64; break; case ISD::SETULE: Opcode = AMDGPU::V_CMP_LT_F32_e64; break; case ISD::SETUNE: case ISD::SETNE: Opcode = AMDGPU::V_CMP_EQ_F32_e64; break; E.g. SETOGT --(swap operands)--> SETOLT --(invert)-->SETUGE which is V_CMP_NLT. (Though as mentioned previously, SETEQ is undefined on NaNs, so it doesn't matter whether it behaves the same as SETOEQ or the same as SETUEQ. And likewise for all the others that don't have an explicit O or U.) foad: No I'm expecting it to look like this: ``` case ISD::SETOEQ: case ISD::SETEQ: Opcode = AMDGPU…
		critsonAuthorUnsubmitted Done Reply Inline Actions Thanks -- I have now validated this version as well. So I suspect in practice we are not generating anything that hits the edge cases. critson: Thanks -- I have now validated this version as well. So I suspect in practice we are not…
		case ISD::SETOEQ:
		case ISD::SETEQ:
		Opcode = AMDGPU::V_CMP_EQ_F32_e64;
		break;
		case ISD::SETOGT:
		case ISD::SETGT:
		Opcode = AMDGPU::V_CMP_GT_F32_e64;
		break;
		case ISD::SETOGE:
		case ISD::SETGE:
		Opcode = AMDGPU::V_CMP_GE_F32_e64;
		break;
		foadUnsubmitted Done Reply Inline Actions It's a bit more conventional to lump SETNE together with SETUNE (unlike all the other comparisons which go with the O form), though of course it doesn't really matter. foad: It's a bit more conventional to lump SETNE together with SETUNE (unlike all the other…
		critsonAuthorUnsubmitted Done Reply Inline Actions The branches of the switch are lifted from the old kill lowering code. So I do not plan to touch it for now. critson: The branches of the switch are lifted from the old kill lowering code. So I do not plan to…
		case ISD::SETOLT:
		case ISD::SETLT:
		Opcode = AMDGPU::V_CMP_LT_F32_e64;
		break;
		case ISD::SETOLE:
		case ISD::SETLE:
		Opcode = AMDGPU::V_CMP_LE_F32_e64;
		break;
		case ISD::SETONE:
		case ISD::SETNE:
		Opcode = AMDGPU::V_CMP_LG_F32_e64;
		break;
		case ISD::SETO:
		Opcode = AMDGPU::V_CMP_O_F32_e64;
		break;
		case ISD::SETUO:
		Opcode = AMDGPU::V_CMP_U_F32_e64;
		break;
		case ISD::SETUEQ:
		Opcode = AMDGPU::V_CMP_NEQ_F32_e64;
		break;
		case ISD::SETUGT:
		Opcode = AMDGPU::V_CMP_NGE_F32_e64;
		break;
		case ISD::SETUGE:
		Opcode = AMDGPU::V_CMP_NGE_F32_e64;
		piotrUnsubmitted Done Reply Inline Actions This looks suspicious: both SETUGT and SETUGE map to V_CMP_NGE. piotr: This looks suspicious: both SETUGT and SETUGE map to V_CMP_NGE.
		break;
		case ISD::SETULT:
		Opcode = AMDGPU::V_CMP_NLT_F32_e64;
		break;
		case ISD::SETULE:
		Opcode = AMDGPU::V_CMP_NLE_F32_e64;
		break;
		case ISD::SETUNE:
		Opcode = AMDGPU::V_CMP_NLG_F32_e64;
		break;
		default:
		llvm_unreachable("invalid ISD:SET cond code");
		}

		// Pick opcode based on comparison type.
		MachineInstr *VcmpMI;
		const MachineOperand &Op0 = MI.getOperand(0);
		const MachineOperand &Op1 = MI.getOperand(1);
		if (TRI->isVGPR(*MRI, Op0.getReg())) {
		Opcode = AMDGPU::getVOPe32(Opcode);
		VcmpMI = BuildMI(MBB, &MI, DL, TII->get(Opcode)).add(Op1).add(Op0);
		} else {
		VcmpMI = BuildMI(MBB, &MI, DL, TII->get(Opcode))
		.addReg(AMDGPU::VCC, RegState::Define)
		.addImm(0) // src0 modifiers
		.add(Op1)
		.addImm(0) // src1 modifiers
		.add(Op0)
		.addImm(0); // omod
		}

		// VCC represents lanes killed.
		Register VCC = ST->isWave32() ? AMDGPU::VCC_LO : AMDGPU::VCC;

		MachineInstr *MaskUpdateMI =
		BuildMI(MBB, MI, DL, TII->get(AndN2Opc), LiveMaskReg)
		.addReg(LiveMaskReg)
		.addReg(VCC);

		// State of SCC represents whether any lanes are live in mask,
		// if SCC is 0 then no lanes will be alive anymore.
		MachineInstr *EarlyTermMI =
		BuildMI(MBB, MI, DL, TII->get(AMDGPU::SI_EARLY_TERMINATE_SCC0));

		MachineInstr *ExecMaskMI =
		BuildMI(MBB, MI, DL, TII->get(AndN2Opc), Exec).addReg(Exec).addReg(VCC);

		assert(MBB.succ_size() == 1);
		MachineInstr *NewTerm = BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_BRANCH))
		.addMBB(*MBB.succ_begin());

		// Update live intervals
		LIS->ReplaceMachineInstrInMaps(MI, *VcmpMI);
		MBB.remove(&MI);

		LIS->InsertMachineInstrInMaps(*MaskUpdateMI);
		LIS->InsertMachineInstrInMaps(*ExecMaskMI);
		LIS->InsertMachineInstrInMaps(*EarlyTermMI);
		LIS->InsertMachineInstrInMaps(*NewTerm);

		return NewTerm;
		}

		MachineInstr *SIWholeQuadMode::lowerKillI1(MachineBasicBlock &MBB,
		MachineInstr &MI, bool IsWQM) {
		const DebugLoc &DL = MI.getDebugLoc();
		MachineInstr *MaskUpdateMI = nullptr;

		const MachineOperand &Op = MI.getOperand(0);
		int64_t KillVal = MI.getOperand(1).getImm();
		MachineInstr *ComputeKilledMaskMI = nullptr;
		Register CndReg = !Op.isImm() ? Op.getReg() : Register();
		Register TmpReg;

		// Is this a static or dynamic kill?
		if (Op.isImm()) {
		if (Op.getImm() == KillVal) {
		// Static: all active lanes are killed
		MaskUpdateMI = BuildMI(MBB, MI, DL, TII->get(AndN2Opc), LiveMaskReg)
		.addReg(LiveMaskReg)
		.addReg(Exec);
		} else {
		// Static: kill does nothing
		MachineInstr *NewTerm = nullptr;
		assert(MBB.succ_size() == 1);
		NewTerm = BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_BRANCH))
		.addMBB(*MBB.succ_begin());
		LIS->ReplaceMachineInstrInMaps(MI, *NewTerm);
		MBB.remove(&MI);
		return NewTerm;
		}
		} else {
		if (!KillVal) {
		// Op represents live lanes after kill,
		// so exec mask needs to be factored in.
		TmpReg = MRI->createVirtualRegister(TRI->getBoolRC());
		ComputeKilledMaskMI =
		BuildMI(MBB, MI, DL, TII->get(XorOpc), TmpReg).add(Op).addReg(Exec);
		MaskUpdateMI = BuildMI(MBB, MI, DL, TII->get(AndN2Opc), LiveMaskReg)
		.addReg(LiveMaskReg)
		.addReg(TmpReg);
		} else {
		// Op represents lanes to kill
		MaskUpdateMI = BuildMI(MBB, MI, DL, TII->get(AndN2Opc), LiveMaskReg)
		.addReg(LiveMaskReg)
		.add(Op);
		}
		}

		// State of SCC represents whether any lanes are live in mask,
		// if SCC is 0 then no lanes will be alive anymore.
		MachineInstr *EarlyTermMI =
		BuildMI(MBB, MI, DL, TII->get(AMDGPU::SI_EARLY_TERMINATE_SCC0));

		// In the case we got this far some lanes are still live,
		// update EXEC to deactivate lanes as appropriate.
		MachineInstr *NewTerm;
		if (Op.isImm()) {
		unsigned MovOpc = ST->isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
		NewTerm = BuildMI(MBB, &MI, DL, TII->get(MovOpc), Exec).addImm(0);
		} else if (!IsWQM) {
		NewTerm = BuildMI(MBB, &MI, DL, TII->get(AndOpc), Exec)
		.addReg(Exec)
		.addReg(LiveMaskReg);
		} else {
		unsigned Opcode = KillVal ? AndN2Opc : AndOpc;
		NewTerm =
		BuildMI(MBB, &MI, DL, TII->get(Opcode), Exec).addReg(Exec).add(Op);
		}

		// Update live intervals
		LIS->RemoveMachineInstrFromMaps(MI);
		MBB.remove(&MI);
		assert(EarlyTermMI);
		assert(MaskUpdateMI);
		assert(NewTerm);
		piotrUnsubmitted Done Reply Inline Actions LiveMaskWQM unused. piotr: LiveMaskWQM unused.
		if (ComputeKilledMaskMI)
		LIS->InsertMachineInstrInMaps(*ComputeKilledMaskMI);
		LIS->InsertMachineInstrInMaps(*MaskUpdateMI);
		piotrUnsubmitted Done Reply Inline Actions WQMMaskMI is always nullptr here. piotr: WQMMaskMI is always nullptr here.
		critsonAuthorUnsubmitted Done Reply Inline Actions Yep, this should be in D94747. critson: Yep, this should be in D94747.
		LIS->InsertMachineInstrInMaps(*EarlyTermMI);
		LIS->InsertMachineInstrInMaps(*NewTerm);

		if (CndReg) {
		LIS->removeInterval(CndReg);
		LIS->createAndComputeVirtRegInterval(CndReg);
		}
		if (TmpReg)
		LIS->createAndComputeVirtRegInterval(TmpReg);

		return NewTerm;
		}

		// Replace (or supplement) instructions accessing live mask.
		// This can only happen once all the live mask registers have been created
		// and the execute state (WQM/WWM/Exact) of instructions is known.
		void SIWholeQuadMode::lowerBlock(MachineBasicBlock &MBB) {
		auto BII = Blocks.find(&MBB);
		if (BII == Blocks.end())
		return;

		const BlockInfo &BI = BII->second;
		if (!BI.NeedsLowering)
		return;

		LLVM_DEBUG(dbgs() << "\nLowering block " << printMBBReference(MBB) << ":\n");

		SmallVector<MachineInstr *, 4> SplitPoints;
		char State = BI.InitialState;

		auto II = MBB.getFirstNonPHI(), IE = MBB.end();
		while (II != IE) {
		auto Next = std::next(II);
		MachineInstr &MI = *II;

		if (StateTransition.count(&MI))
		State = StateTransition[&MI];

		MachineInstr *SplitPoint = nullptr;
		switch (MI.getOpcode()) {
		case AMDGPU::SI_KILL_I1_TERMINATOR:
		SplitPoint = lowerKillI1(MBB, MI, State == StateWQM);
		break;
		case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
		SplitPoint = lowerKillF32(MBB, MI);
		break;
		default:
		break;
		}
		if (SplitPoint)
		SplitPoints.push_back(SplitPoint);

		II = Next;
		}

		// Perform splitting after instruction scan to simplify iteration.
		if (!SplitPoints.empty()) {
		MachineBasicBlock *BB = &MBB;
		for (MachineInstr *MI : SplitPoints) {
		BB = splitBlock(BB, MI);
		}
		}
		}

// Return an iterator in the (inclusive) range [First, Last] at which		// Return an iterator in the (inclusive) range [First, Last] at which
// instructions can be safely inserted, keeping in mind that some of the		// instructions can be safely inserted, keeping in mind that some of the
// instructions we want to add necessarily clobber SCC.		// instructions we want to add necessarily clobber SCC.
MachineBasicBlock::iterator SIWholeQuadMode::prepareInsertion(		MachineBasicBlock::iterator SIWholeQuadMode::prepareInsertion(
MachineBasicBlock &MBB, MachineBasicBlock::iterator First,		MachineBasicBlock &MBB, MachineBasicBlock::iterator First,
MachineBasicBlock::iterator Last, bool PreferLast, bool SaveSCC) {		MachineBasicBlock::iterator Last, bool PreferLast, bool SaveSCC) {
if (!SaveSCC)		if (!SaveSCC)
return PreferLast ? Last : First;		return PreferLast ? Last : First;
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	MachineBasicBlock::iterator SIWholeQuadMode::prepareInsertion(
if (S)		if (S)
MBBI = saveSCC(MBB, MBBI);		MBBI = saveSCC(MBB, MBBI);

return MBBI;		return MBBI;
}		}

void SIWholeQuadMode::toExact(MachineBasicBlock &MBB,		void SIWholeQuadMode::toExact(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before,		MachineBasicBlock::iterator Before,
unsigned SaveWQM, unsigned LiveMaskReg) {		Register SaveWQM) {
MachineInstr *MI;		MachineInstr *MI;

if (SaveWQM) {		if (SaveWQM) {
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(ST->isWave32() ?		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AndSaveExecOpc), SaveWQM)
AMDGPU::S_AND_SAVEEXEC_B32 : AMDGPU::S_AND_SAVEEXEC_B64),
SaveWQM)
.addReg(LiveMaskReg);		.addReg(LiveMaskReg);
} else {		} else {
unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AndOpc), Exec)
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(ST->isWave32() ?
AMDGPU::S_AND_B32 : AMDGPU::S_AND_B64),
Exec)
.addReg(Exec)		.addReg(Exec)
.addReg(LiveMaskReg);		.addReg(LiveMaskReg);
}		}

LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
		StateTransition[MI] = StateExact;
}		}

void SIWholeQuadMode::toWQM(MachineBasicBlock &MBB,		void SIWholeQuadMode::toWQM(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before,		MachineBasicBlock::iterator Before,
unsigned SavedWQM) {		Register SavedWQM) {
MachineInstr *MI;		MachineInstr *MI;

unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
if (SavedWQM) {		if (SavedWQM) {
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::COPY), Exec)		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::COPY), Exec)
.addReg(SavedWQM);		.addReg(SavedWQM);
} else {		} else {
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(ST->isWave32() ?		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(WQMOpc), Exec).addReg(Exec);
AMDGPU::S_WQM_B32 : AMDGPU::S_WQM_B64),
Exec)
.addReg(Exec);
}		}

LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
		StateTransition[MI] = StateWQM;
}		}

void SIWholeQuadMode::toWWM(MachineBasicBlock &MBB,		void SIWholeQuadMode::toWWM(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before,		MachineBasicBlock::iterator Before,
unsigned SaveOrig) {		Register SaveOrig) {
MachineInstr *MI;		MachineInstr *MI;

assert(SaveOrig);		assert(SaveOrig);
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::ENTER_WWM), SaveOrig)		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::ENTER_WWM), SaveOrig)
.addImm(-1);		.addImm(-1);
LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
		StateTransition[MI] = StateWWM;
}		}

void SIWholeQuadMode::fromWWM(MachineBasicBlock &MBB,		void SIWholeQuadMode::fromWWM(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before,		MachineBasicBlock::iterator Before,
unsigned SavedOrig) {		Register SavedOrig, char NonWWMState) {
		piotrUnsubmitted Done Reply Inline Actions It looks the new param NonWWMState is unused? piotr: It looks the new param NonWWMState is unused?
		piotrUnsubmitted Done Reply Inline Actions I can see you are going to use it in D94747, but as it stands now it will cause a build warning. piotr: I can see you are going to use it in D94747, but as it stands now it will cause a build warning.
		critsonAuthorUnsubmitted Done Reply Inline Actions Move to D94746. critson: Move to D94746.
MachineInstr *MI;		MachineInstr *MI;

assert(SavedOrig);		assert(SavedOrig);
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::EXIT_WWM),		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::EXIT_WWM), Exec)
ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC)
.addReg(SavedOrig);		.addReg(SavedOrig);
LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
		StateTransition[MI] = NonWWMState;
}		}

void SIWholeQuadMode::processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg,		void SIWholeQuadMode::processBlock(MachineBasicBlock &MBB, bool IsEntry) {
		piotrUnsubmitted Done Reply Inline Actions Nit: I tend to agree with the clang-tidy warning. 's/isEntry/IsEntry/' for consistency? piotr: Nit: I tend to agree with the clang-tidy warning. 's/isEntry/IsEntry/' for consistency?
bool isEntry) {
auto BII = Blocks.find(&MBB);		auto BII = Blocks.find(&MBB);
if (BII == Blocks.end())		if (BII == Blocks.end())
return;		return;

const BlockInfo &BI = BII->second;		BlockInfo &BI = BII->second;

// This is a non-entry block that is WQM throughout, so no need to do		// This is a non-entry block that is WQM throughout, so no need to do
// anything.		// anything.
if (!isEntry && BI.Needs == StateWQM && BI.OutNeeds != StateExact)		if (!IsEntry && BI.Needs == StateWQM && BI.OutNeeds != StateExact) {
		BI.InitialState = StateWQM;
		foadUnsubmitted Done Reply Inline Actions No need to add braces. foad: No need to add braces.
return;		return;
		}

LLVM_DEBUG(dbgs() << "\nProcessing block " << printMBBReference(MBB)		LLVM_DEBUG(dbgs() << "\nProcessing block " << printMBBReference(MBB)
<< ":\n");		<< ":\n");

unsigned SavedWQMReg = 0;		Register SavedWQMReg;
unsigned SavedNonWWMReg = 0;		Register SavedNonWWMReg;
bool WQMFromExec = isEntry;		bool WQMFromExec = IsEntry;
char State = (isEntry \|\| !(BI.InNeeds & StateWQM)) ? StateExact : StateWQM;		char State = (IsEntry \|\| !(BI.InNeeds & StateWQM)) ? StateExact : StateWQM;
char NonWWMState = 0;		char NonWWMState = 0;
const TargetRegisterClass *BoolRC = TRI->getBoolRC();		const TargetRegisterClass *BoolRC = TRI->getBoolRC();

auto II = MBB.getFirstNonPHI(), IE = MBB.end();		auto II = MBB.getFirstNonPHI(), IE = MBB.end();
if (isEntry) {		if (IsEntry) {
// Skip the instruction that saves LiveMask		// Skip the instruction that saves LiveMask
if (II != IE && II->getOpcode() == AMDGPU::COPY)		if (II != IE && II->getOpcode() == AMDGPU::COPY)
++II;		++II;
}		}

// This stores the first instruction where it's safe to switch from WQM to		// This stores the first instruction where it's safe to switch from WQM to
// Exact or vice versa.		// Exact or vice versa.
MachineBasicBlock::iterator FirstWQM = IE;		MachineBasicBlock::iterator FirstWQM = IE;

// This stores the first instruction where it's safe to switch from WWM to		// This stores the first instruction where it's safe to switch from WWM to
// Exact/WQM or to switch to WWM. It must always be the same as, or after,		// Exact/WQM or to switch to WWM. It must always be the same as, or after,
// FirstWQM since if it's safe to switch to/from WWM, it must be safe to		// FirstWQM since if it's safe to switch to/from WWM, it must be safe to
// switch to/from WQM as well.		// switch to/from WQM as well.
MachineBasicBlock::iterator FirstWWM = IE;		MachineBasicBlock::iterator FirstWWM = IE;

		// Record initial state is block information.
		BI.InitialState = State;

for (;;) {		for (;;) {
MachineBasicBlock::iterator Next = II;		MachineBasicBlock::iterator Next = II;
char Needs = StateExact \| StateWQM; // WWM is disabled by default		char Needs = StateExact \| StateWQM; // WWM is disabled by default
char OutNeeds = 0;		char OutNeeds = 0;

if (FirstWQM == IE)		if (FirstWQM == IE)
FirstWQM = II;		FirstWQM = II;

▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	if (!(Needs & State)) {
}		}

MachineBasicBlock::iterator Before =		MachineBasicBlock::iterator Before =
prepareInsertion(MBB, First, II, Needs == StateWQM,		prepareInsertion(MBB, First, II, Needs == StateWQM,
Needs == StateExact \|\| WQMFromExec);		Needs == StateExact \|\| WQMFromExec);

if (State == StateWWM) {		if (State == StateWWM) {
assert(SavedNonWWMReg);		assert(SavedNonWWMReg);
fromWWM(MBB, Before, SavedNonWWMReg);		fromWWM(MBB, Before, SavedNonWWMReg, NonWWMState);
LIS->createAndComputeVirtRegInterval(SavedNonWWMReg);		LIS->createAndComputeVirtRegInterval(SavedNonWWMReg);
SavedNonWWMReg = 0;		SavedNonWWMReg = 0;
State = NonWWMState;		State = NonWWMState;
}		}

if (Needs == StateWWM) {		if (Needs == StateWWM) {
NonWWMState = State;		NonWWMState = State;
assert(!SavedNonWWMReg);		assert(!SavedNonWWMReg);
SavedNonWWMReg = MRI->createVirtualRegister(BoolRC);		SavedNonWWMReg = MRI->createVirtualRegister(BoolRC);
toWWM(MBB, Before, SavedNonWWMReg);		toWWM(MBB, Before, SavedNonWWMReg);
State = StateWWM;		State = StateWWM;
} else {		} else {
if (State == StateWQM && (Needs & StateExact) && !(Needs & StateWQM)) {		if (State == StateWQM && (Needs & StateExact) && !(Needs & StateWQM)) {
if (!WQMFromExec && (OutNeeds & StateWQM)) {		if (!WQMFromExec && (OutNeeds & StateWQM)) {
assert(!SavedWQMReg);		assert(!SavedWQMReg);
SavedWQMReg = MRI->createVirtualRegister(BoolRC);		SavedWQMReg = MRI->createVirtualRegister(BoolRC);
}		}

toExact(MBB, Before, SavedWQMReg, LiveMaskReg);		toExact(MBB, Before, SavedWQMReg);
State = StateExact;		State = StateExact;
} else if (State == StateExact && (Needs & StateWQM) &&		} else if (State == StateExact && (Needs & StateWQM) &&
!(Needs & StateExact)) {		!(Needs & StateExact)) {
assert(WQMFromExec == (SavedWQMReg == 0));		assert(WQMFromExec == (SavedWQMReg == 0));

toWQM(MBB, Before, SavedWQMReg);		toWQM(MBB, Before, SavedWQMReg);

if (SavedWQMReg) {		if (SavedWQMReg) {
Show All 19 Lines	if (II == IE)
break;		break;

II = Next;		II = Next;
}		}
assert(!SavedWQMReg);		assert(!SavedWQMReg);
assert(!SavedNonWWMReg);		assert(!SavedNonWWMReg);
}		}

void SIWholeQuadMode::lowerLiveMaskQueries(unsigned LiveMaskReg) {		void SIWholeQuadMode::lowerLiveMaskQueries() {
for (MachineInstr *MI : LiveMaskQueries) {		for (MachineInstr *MI : LiveMaskQueries) {
const DebugLoc &DL = MI->getDebugLoc();		const DebugLoc &DL = MI->getDebugLoc();
Register Dest = MI->getOperand(0).getReg();		Register Dest = MI->getOperand(0).getReg();

MachineInstr *Copy =		MachineInstr *Copy =
BuildMI(*MI->getParent(), MI, DL, TII->get(AMDGPU::COPY), Dest)		BuildMI(*MI->getParent(), MI, DL, TII->get(AMDGPU::COPY), Dest)
.addReg(LiveMaskReg);		.addReg(LiveMaskReg);

Show All 15 Lines	if (TRI->isVGPR(*MRI, Reg)) {
if (SubReg)		if (SubReg)
regClass = TRI->getSubRegClass(regClass, SubReg);		regClass = TRI->getSubRegClass(regClass, SubReg);

const unsigned MovOp = TII->getMovOpcode(regClass);		const unsigned MovOp = TII->getMovOpcode(regClass);
MI->setDesc(TII->get(MovOp));		MI->setDesc(TII->get(MovOp));

// And make it implicitly depend on exec (like all VALU movs should do).		// And make it implicitly depend on exec (like all VALU movs should do).
MI->addOperand(MachineOperand::CreateReg(AMDGPU::EXEC, false, true));		MI->addOperand(MachineOperand::CreateReg(AMDGPU::EXEC, false, true));
} else if (!MRI->isSSA()) {		} else {
// Remove early-clobber and exec dependency from simple SGPR copies.		// Remove early-clobber and exec dependency from simple SGPR copies.
// This allows some to be eliminated during/post RA.		// This allows some to be eliminated during/post RA.
LLVM_DEBUG(dbgs() << "simplify SGPR copy: " << *MI);		LLVM_DEBUG(dbgs() << "simplify SGPR copy: " << *MI);
if (MI->getOperand(0).isEarlyClobber()) {		if (MI->getOperand(0).isEarlyClobber()) {
LIS->removeInterval(Reg);		LIS->removeInterval(Reg);
MI->getOperand(0).setIsEarlyClobber(false);		MI->getOperand(0).setIsEarlyClobber(false);
LIS->createAndComputeVirtRegInterval(Reg);		LIS->createAndComputeVirtRegInterval(Reg);
}		}
Show All 19 Lines	for (MachineInstr *MI : LowerToCopyInstrs) {
} else {		} else {
assert(MI->getNumExplicitOperands() == 2);		assert(MI->getNumExplicitOperands() == 2);
}		}

MI->setDesc(TII->get(AMDGPU::COPY));		MI->setDesc(TII->get(AMDGPU::COPY));
}		}
}		}

		void SIWholeQuadMode::lowerKillInstrs(bool IsWQM) {
		for (MachineInstr *MI : KillInstrs) {
		MachineBasicBlock *MBB = MI->getParent();
		MachineInstr *SplitPoint = nullptr;
		switch (MI->getOpcode()) {
		case AMDGPU::SI_KILL_I1_TERMINATOR:
		SplitPoint = lowerKillI1(MBB, MI, IsWQM);
		break;
		case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
		SplitPoint = lowerKillF32(MBB, MI);
		break;
		default:
		continue;
		}
		if (SplitPoint)
		splitBlock(MBB, SplitPoint);
		}
		}

bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {		bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {
Instructions.clear();		Instructions.clear();
Blocks.clear();		Blocks.clear();
LiveMaskQueries.clear();		LiveMaskQueries.clear();
LowerToCopyInstrs.clear();		LowerToCopyInstrs.clear();
LowerToMovInstrs.clear();		LowerToMovInstrs.clear();
CallingConv = MF.getFunction().getCallingConv();		KillInstrs.clear();
		StateTransition.clear();

ST = &MF.getSubtarget<GCNSubtarget>();		ST = &MF.getSubtarget<GCNSubtarget>();

TII = ST->getInstrInfo();		TII = ST->getInstrInfo();
TRI = &TII->getRegisterInfo();		TRI = &TII->getRegisterInfo();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
LIS = &getAnalysis<LiveIntervals>();		LIS = &getAnalysis<LiveIntervals>();
		MDT = &getAnalysis<MachineDominatorTree>();
		PDT = &getAnalysis<MachinePostDominatorTree>();

if (ST->isWave32()) {		if (ST->isWave32()) {
AndOpc = AMDGPU::S_AND_B32;		AndOpc = AMDGPU::S_AND_B32;
XorTermrOpc = AMDGPU::S_XOR_B32_term;		AndN2Opc = AMDGPU::S_ANDN2_B32;
		XorOpc = AMDGPU::S_XOR_B32;
		AndSaveExecOpc = AMDGPU::S_AND_SAVEEXEC_B32;
OrSaveExecOpc = AMDGPU::S_OR_SAVEEXEC_B32;		OrSaveExecOpc = AMDGPU::S_OR_SAVEEXEC_B32;
		WQMOpc = AMDGPU::S_WQM_B32;
Exec = AMDGPU::EXEC_LO;		Exec = AMDGPU::EXEC_LO;
} else {		} else {
AndOpc = AMDGPU::S_AND_B64;		AndOpc = AMDGPU::S_AND_B64;
XorTermrOpc = AMDGPU::S_XOR_B64_term;		AndN2Opc = AMDGPU::S_ANDN2_B64;
		XorOpc = AMDGPU::S_XOR_B64;
		AndSaveExecOpc = AMDGPU::S_AND_SAVEEXEC_B64;
OrSaveExecOpc = AMDGPU::S_OR_SAVEEXEC_B64;		OrSaveExecOpc = AMDGPU::S_OR_SAVEEXEC_B64;
		WQMOpc = AMDGPU::S_WQM_B64;
Exec = AMDGPU::EXEC;		Exec = AMDGPU::EXEC;
}		}

char GlobalFlags = analyzeFunction(MF);		const char GlobalFlags = analyzeFunction(MF);
unsigned LiveMaskReg = 0;		const bool NeedsLiveMask = !(KillInstrs.empty() && LiveMaskQueries.empty());
if (!(GlobalFlags & StateWQM)) {
lowerLiveMaskQueries(Exec);		LiveMaskReg = Exec;
if (!(GlobalFlags & StateWWM) && LowerToCopyInstrs.empty() && LowerToMovInstrs.empty())
		// Shader is simple does not need WQM/WWM or any complex lowering
		if (!(GlobalFlags & (StateWQM \| StateWWM)) && LowerToCopyInstrs.empty() &&
		LowerToMovInstrs.empty() && KillInstrs.empty()) {
		lowerLiveMaskQueries();
return !LiveMaskQueries.empty();		return !LiveMaskQueries.empty();
} else {		}
		piotrUnsubmitted Done Reply Inline Actions Is the comment up-to-date? Did you mean: "does not need WQM nor WWM"? piotr: Is the comment up-to-date? Did you mean: "does not need WQM nor WWM"?
// Store a copy of the original live mask when required
MachineBasicBlock &Entry = MF.front();		MachineBasicBlock &Entry = MF.front();
MachineBasicBlock::iterator EntryMI = Entry.getFirstNonPHI();		MachineBasicBlock::iterator EntryMI = Entry.getFirstNonPHI();

if (GlobalFlags & StateExact \|\| !LiveMaskQueries.empty()) {		// Store a copy of the original live mask when required
		if (NeedsLiveMask \|\| (GlobalFlags & StateWQM)) {
LiveMaskReg = MRI->createVirtualRegister(TRI->getBoolRC());		LiveMaskReg = MRI->createVirtualRegister(TRI->getBoolRC());
MachineInstr *MI = BuildMI(Entry, EntryMI, DebugLoc(),		MachineInstr *MI =
TII->get(AMDGPU::COPY), LiveMaskReg)		BuildMI(Entry, EntryMI, DebugLoc(), TII->get(AMDGPU::COPY), LiveMaskReg)
.addReg(Exec);
LIS->InsertMachineInstrInMaps(*MI);
}

lowerLiveMaskQueries(LiveMaskReg);

if (GlobalFlags == StateWQM) {
// For a shader that needs only WQM, we can just set it once.
auto MI = BuildMI(Entry, EntryMI, DebugLoc(),
TII->get(ST->isWave32() ? AMDGPU::S_WQM_B32
: AMDGPU::S_WQM_B64),
Exec)
.addReg(Exec);		.addReg(Exec);
LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);

lowerCopyInstrs();
// EntryMI may become invalid here
return true;
}
}		}

LLVM_DEBUG(printInfo());		LLVM_DEBUG(printInfo());

		lowerLiveMaskQueries();
lowerCopyInstrs();		lowerCopyInstrs();

// Handle the general case		// Shader only needs WQM
		if (GlobalFlags == StateWQM) {
		auto MI = BuildMI(Entry, EntryMI, DebugLoc(), TII->get(WQMOpc), Exec)
		.addReg(Exec);
		LIS->InsertMachineInstrInMaps(*MI);
		lowerKillInstrs(true);
		} else {
for (auto BII : Blocks)		for (auto BII : Blocks)
processBlock(BII.first, LiveMaskReg, BII.first == &MF.begin());		processBlock(*BII.first, BII.first == &Entry);
		// Lowering blocks causes block splitting so perform as a second pass.
		for (auto BII : Blocks)
		lowerBlock(*BII.first);
		}

if (LiveMaskReg)		// Compute live range for live mask
		if (LiveMaskReg != Exec)
LIS->createAndComputeVirtRegInterval(LiveMaskReg);		LIS->createAndComputeVirtRegInterval(LiveMaskReg);

// Physical registers like SCC aren't tracked by default anyway, so just		// Physical registers like SCC aren't tracked by default anyway, so just
// removing the ranges we computed is the simplest option for maintaining		// removing the ranges we computed is the simplest option for maintaining
// the analysis results.		// the analysis results.
LIS->removeRegUnit(*MCRegUnitIterator(MCRegister::from(AMDGPU::SCC), TRI));		LIS->removeRegUnit(*MCRegUnitIterator(MCRegister::from(AMDGPU::SCC), TRI));

		// If we performed any kills then recompute EXEC
		if (!KillInstrs.empty())
		LIS->removeRegUnit(*MCRegUnitIterator(AMDGPU::EXEC, TRI));

return true;		return true;
}		}

llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll

	Show First 20 Lines • Show All 200 Lines • ▼ Show 20 Lines
	; GFX7-NEXT: ; %bb.1: ; %if			; GFX7-NEXT: ; %bb.1: ; %if
	; GFX7-NEXT: s_waitcnt vmcnt(0)			; GFX7-NEXT: s_waitcnt vmcnt(0)
	; GFX7-NEXT: buffer_store_dword v0, off, s[0:3], 0			; GFX7-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; GFX7-NEXT: BB1_2: ; %else			; GFX7-NEXT: BB1_2: ; %else
	; GFX7-NEXT: s_endpgm			; GFX7-NEXT: s_endpgm
	;			;
	; GFX8-LABEL: add_i32_varying:			; GFX8-LABEL: add_i32_varying:
	; GFX8: ; %bb.0: ; %entry			; GFX8: ; %bb.0: ; %entry
	; GFX8-NEXT: s_mov_b64 s[10:11], exec			; GFX8-NEXT: s_mov_b64 s[8:9], exec
				; GFX8-NEXT: s_mov_b64 s[10:11], s[8:9]
	; GFX8-NEXT: v_mov_b32_e32 v2, v0			; GFX8-NEXT: v_mov_b32_e32 v2, v0
	; GFX8-NEXT: ; implicit-def: $vgpr0			; GFX8-NEXT: ; implicit-def: $vgpr0
	; GFX8-NEXT: s_and_saveexec_b64 s[8:9], s[10:11]			; GFX8-NEXT: s_and_saveexec_b64 s[8:9], s[10:11]
	; GFX8-NEXT: s_cbranch_execz BB1_4			; GFX8-NEXT: s_cbranch_execz BB1_4
	; GFX8-NEXT: ; %bb.1:			; GFX8-NEXT: ; %bb.1:
	; GFX8-NEXT: s_or_saveexec_b64 s[10:11], -1			; GFX8-NEXT: s_or_saveexec_b64 s[10:11], -1
	; GFX8-NEXT: v_mov_b32_e32 v1, 0			; GFX8-NEXT: v_mov_b32_e32 v1, 0
	; GFX8-NEXT: s_mov_b64 exec, s[10:11]			; GFX8-NEXT: s_mov_b64 exec, s[10:11]
	Show All 38 Lines
	; GFX8-NEXT: s_cbranch_vccnz BB1_6			; GFX8-NEXT: s_cbranch_vccnz BB1_6
	; GFX8-NEXT: ; %bb.5: ; %if			; GFX8-NEXT: ; %bb.5: ; %if
	; GFX8-NEXT: buffer_store_dword v0, off, s[0:3], 0			; GFX8-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; GFX8-NEXT: BB1_6: ; %UnifiedReturnBlock			; GFX8-NEXT: BB1_6: ; %UnifiedReturnBlock
	; GFX8-NEXT: s_endpgm			; GFX8-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: add_i32_varying:			; GFX9-LABEL: add_i32_varying:
	; GFX9: ; %bb.0: ; %entry			; GFX9: ; %bb.0: ; %entry
	; GFX9-NEXT: s_mov_b64 s[10:11], exec			; GFX9-NEXT: s_mov_b64 s[8:9], exec
				; GFX9-NEXT: s_mov_b64 s[10:11], s[8:9]
	; GFX9-NEXT: v_mov_b32_e32 v2, v0			; GFX9-NEXT: v_mov_b32_e32 v2, v0
	; GFX9-NEXT: ; implicit-def: $vgpr0			; GFX9-NEXT: ; implicit-def: $vgpr0
	; GFX9-NEXT: s_and_saveexec_b64 s[8:9], s[10:11]			; GFX9-NEXT: s_and_saveexec_b64 s[8:9], s[10:11]
	; GFX9-NEXT: s_cbranch_execz BB1_4			; GFX9-NEXT: s_cbranch_execz BB1_4
	; GFX9-NEXT: ; %bb.1:			; GFX9-NEXT: ; %bb.1:
	; GFX9-NEXT: s_or_saveexec_b64 s[10:11], -1			; GFX9-NEXT: s_or_saveexec_b64 s[10:11], -1
	; GFX9-NEXT: v_mov_b32_e32 v1, 0			; GFX9-NEXT: v_mov_b32_e32 v1, 0
	; GFX9-NEXT: s_mov_b64 exec, s[10:11]			; GFX9-NEXT: s_mov_b64 exec, s[10:11]
	Show All 38 Lines
	; GFX9-NEXT: s_cbranch_vccnz BB1_6			; GFX9-NEXT: s_cbranch_vccnz BB1_6
	; GFX9-NEXT: ; %bb.5: ; %if			; GFX9-NEXT: ; %bb.5: ; %if
	; GFX9-NEXT: buffer_store_dword v0, off, s[0:3], 0			; GFX9-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; GFX9-NEXT: BB1_6: ; %UnifiedReturnBlock			; GFX9-NEXT: BB1_6: ; %UnifiedReturnBlock
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX1064-LABEL: add_i32_varying:			; GFX1064-LABEL: add_i32_varying:
	; GFX1064: ; %bb.0: ; %entry			; GFX1064: ; %bb.0: ; %entry
	; GFX1064-NEXT: s_mov_b64 s[10:11], exec			; GFX1064-NEXT: s_mov_b64 s[8:9], exec
	; GFX1064-NEXT: v_mov_b32_e32 v1, v0			; GFX1064-NEXT: v_mov_b32_e32 v1, v0
				; GFX1064-NEXT: s_mov_b64 s[10:11], s[8:9]
	; GFX1064-NEXT: ; implicit-def: $vgpr0			; GFX1064-NEXT: ; implicit-def: $vgpr0
	; GFX1064-NEXT: s_and_saveexec_b64 s[8:9], s[10:11]			; GFX1064-NEXT: s_and_saveexec_b64 s[8:9], s[10:11]
	; GFX1064-NEXT: s_cbranch_execz BB1_4			; GFX1064-NEXT: s_cbranch_execz BB1_4
	; GFX1064-NEXT: ; %bb.1:			; GFX1064-NEXT: ; %bb.1:
	; GFX1064-NEXT: s_not_b64 exec, exec			; GFX1064-NEXT: s_not_b64 exec, exec
	; GFX1064-NEXT: v_mov_b32_e32 v1, 0			; GFX1064-NEXT: v_mov_b32_e32 v1, 0
	; GFX1064-NEXT: s_not_b64 exec, exec			; GFX1064-NEXT: s_not_b64 exec, exec
	; GFX1064-NEXT: s_or_saveexec_b64 s[10:11], -1			; GFX1064-NEXT: s_or_saveexec_b64 s[10:11], -1
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	; GFX1064-NEXT: s_cbranch_vccnz BB1_6			; GFX1064-NEXT: s_cbranch_vccnz BB1_6
	; GFX1064-NEXT: ; %bb.5: ; %if			; GFX1064-NEXT: ; %bb.5: ; %if
	; GFX1064-NEXT: buffer_store_dword v0, off, s[0:3], 0			; GFX1064-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; GFX1064-NEXT: BB1_6: ; %UnifiedReturnBlock			; GFX1064-NEXT: BB1_6: ; %UnifiedReturnBlock
	; GFX1064-NEXT: s_endpgm			; GFX1064-NEXT: s_endpgm
	;			;
	; GFX1032-LABEL: add_i32_varying:			; GFX1032-LABEL: add_i32_varying:
	; GFX1032: ; %bb.0: ; %entry			; GFX1032: ; %bb.0: ; %entry
	; GFX1032-NEXT: s_mov_b32 s9, exec_lo			; GFX1032-NEXT: s_mov_b32 s8, exec_lo
	; GFX1032-NEXT: v_mov_b32_e32 v1, v0			; GFX1032-NEXT: v_mov_b32_e32 v1, v0
				; GFX1032-NEXT: s_mov_b32 s9, s8
	; GFX1032-NEXT: ; implicit-def: $vgpr0			; GFX1032-NEXT: ; implicit-def: $vgpr0
	; GFX1032-NEXT: s_and_saveexec_b32 s8, s9			; GFX1032-NEXT: s_and_saveexec_b32 s8, s9
	; GFX1032-NEXT: s_cbranch_execz BB1_4			; GFX1032-NEXT: s_cbranch_execz BB1_4
	; GFX1032-NEXT: ; %bb.1:			; GFX1032-NEXT: ; %bb.1:
	; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo			; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX1032-NEXT: v_mov_b32_e32 v1, 0			; GFX1032-NEXT: v_mov_b32_e32 v1, 0
	; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo			; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX1032-NEXT: s_or_saveexec_b32 s9, -1			; GFX1032-NEXT: s_or_saveexec_b32 s9, -1
	▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/early-term.mir

# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py		# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
# RUN: llc -march=amdgcn -mcpu=gfx1010 -run-pass=si-insert-skips -verify-machineinstrs %s -o - \| FileCheck %s		# RUN: llc -march=amdgcn -mcpu=gfx1010 -run-pass=si-insert-skips -verify-machineinstrs %s -o - \| FileCheck %s

--- \|		--- \|
define amdgpu_ps void @early_term_scc0_end_block() {		define amdgpu_ps void @early_term_scc0_end_block() {
ret void		ret void
}		}

define amdgpu_ps void @early_term_scc0_next_terminator() {		define amdgpu_ps void @early_term_scc0_next_terminator() {
ret void		ret void
}		}

define amdgpu_ps void @early_term_scc0_in_block() {		define amdgpu_ps void @early_term_scc0_in_block() {
ret void		ret void
}		}

define amdgpu_ps void @early_term_scc0_with_kill() {
ret void
}

define amdgpu_gs void @early_term_scc0_gs() {		define amdgpu_gs void @early_term_scc0_gs() {
ret void		ret void
}		}

define amdgpu_cs void @early_term_scc0_cs() {		define amdgpu_cs void @early_term_scc0_cs() {
ret void		ret void
}		}
...		...
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	body: \|
bb.1:		bb.1:
liveins: $vgpr0, $vgpr1		liveins: $vgpr0, $vgpr1
EXP 1, $vgpr1, $vgpr1, $vgpr1, $vgpr1, -1, -1, 15, implicit $exec		EXP 1, $vgpr1, $vgpr1, $vgpr1, $vgpr1, -1, -1, 15, implicit $exec
EXP_DONE 0, $vgpr0, $vgpr0, $vgpr0, $vgpr0, -1, -1, 15, implicit $exec		EXP_DONE 0, $vgpr0, $vgpr0, $vgpr0, $vgpr0, -1, -1, 15, implicit $exec
S_ENDPGM 0		S_ENDPGM 0
...		...

---		---
name: early_term_scc0_with_kill
tracksRegLiveness: true
liveins:
- { reg: '$sgpr0' }
- { reg: '$sgpr1' }
- { reg: '$vgpr2' }
body: \|
; CHECK-LABEL: name: early_term_scc0_with_kill
; CHECK: bb.0:
; CHECK: successors: %bb.1(0x80000000), %bb.3(0x00000000)
; CHECK: liveins: $sgpr0, $sgpr1, $vgpr2
; CHECK: $vgpr0 = V_MOV_B32_e32 0, implicit $exec
; CHECK: V_CMPX_LE_F32_nosdst_e32 0, killed $vgpr2, implicit-def $exec, implicit $mode, implicit $exec
; CHECK: S_CBRANCH_EXECZ %bb.3, implicit $exec
; CHECK: bb.1:
; CHECK: successors: %bb.4(0x40000000), %bb.3(0x40000000)
; CHECK: liveins: $sgpr0, $sgpr1, $vgpr0
; CHECK: dead $sgpr0 = S_AND_B32 $sgpr0, killed $sgpr1, implicit-def $scc
; CHECK: S_CBRANCH_SCC0 %bb.3, implicit $scc
; CHECK: bb.4:
; CHECK: successors: %bb.2(0x80000000)
; CHECK: liveins: $vgpr0, $scc
; CHECK: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
; CHECK: bb.2:
; CHECK: liveins: $vgpr0, $vgpr1
; CHECK: EXP 1, $vgpr1, $vgpr1, $vgpr1, $vgpr1, -1, -1, 15, implicit $exec
; CHECK: EXP_DONE 0, $vgpr0, $vgpr0, $vgpr0, $vgpr0, -1, -1, 15, implicit $exec
; CHECK: S_ENDPGM 0
; CHECK: bb.3:
; CHECK: $exec_lo = S_MOV_B32 0
; CHECK: EXP_DONE 9, undef $vgpr0, undef $vgpr0, undef $vgpr0, undef $vgpr0, 1, 0, 0, implicit $exec
; CHECK: S_ENDPGM 0
bb.0:
liveins: $sgpr0, $sgpr1, $vgpr2
successors: %bb.1
$vgpr0 = V_MOV_B32_e32 0, implicit $exec
SI_KILL_F32_COND_IMM_TERMINATOR killed $vgpr2, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec

bb.1:
liveins: $sgpr0, $sgpr1, $vgpr0
successors: %bb.2
dead $sgpr0 = S_AND_B32 $sgpr0, killed $sgpr1, implicit-def $scc
SI_EARLY_TERMINATE_SCC0 implicit $scc, implicit $exec
$vgpr1 = V_MOV_B32_e32 1, implicit $exec

bb.2:
liveins: $vgpr0, $vgpr1
EXP 1, $vgpr1, $vgpr1, $vgpr1, $vgpr1, -1, -1, 15, implicit $exec
EXP_DONE 0, $vgpr0, $vgpr0, $vgpr0, $vgpr0, -1, -1, 15, implicit $exec
S_ENDPGM 0
...

---
name: early_term_scc0_gs		name: early_term_scc0_gs
tracksRegLiveness: true		tracksRegLiveness: true
liveins:		liveins:
- { reg: '$sgpr0' }		- { reg: '$sgpr0' }
- { reg: '$sgpr1' }		- { reg: '$sgpr1' }
body: \|		body: \|
; CHECK-LABEL: name: early_term_scc0_gs		; CHECK-LABEL: name: early_term_scc0_gs
; CHECK: bb.0:		; CHECK: bb.0:
▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/insert-skips-kill-uncond.mir

This file was deleted.

	# RUN: llc -march=amdgcn -mcpu=polaris10 -run-pass si-insert-skips -amdgpu-skip-threshold-legacy=1 %s -o - \| FileCheck %s
	# https://bugs.freedesktop.org/show_bug.cgi?id=99019
	--- \|
	define amdgpu_ps void @kill_uncond_branch() {
	ret void
	}
	...
	---

	# CHECK-LABEL: name: kill_uncond_branch

	# CHECK: bb.0:
	# CHECK: S_CBRANCH_VCCNZ %bb.1, implicit $vcc

	# CHECK: bb.1:
	# CHECK: V_CMPX_LE_F32_e32
	# CHECK-NEXT: S_CBRANCH_EXECZ %bb.3, implicit $exec

	# CHECK: bb.2:
	# CHECK: S_ENDPGM 0

	# CHECK: bb.3:
	# CHECK-NEXT: EXP_DONE
	# CHECK: S_ENDPGM 0

	name: kill_uncond_branch

	body: \|
	bb.0:
	successors: %bb.1
	S_CBRANCH_VCCNZ %bb.1, implicit $vcc

	bb.1:
	successors: %bb.2
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec
	S_BRANCH %bb.2

	bb.2:
	S_ENDPGM 0

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll

; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,SI %s		; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,SI %s
; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,SI %s		; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,SI %s
; RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX10 %s		; RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX10 %s

; GCN-LABEL: {{^}}gs_const:		; GCN-LABEL: {{^}}gs_const:
; GCN-NOT: v_cmpx		; GCN-NOT: v_cmpx
; GCN: s_mov_b64 exec, 0		; GCN: s_mov_b64 exec, 0
define amdgpu_gs void @gs_const() {		define amdgpu_gs void @gs_const() {
%tmp = icmp ule i32 0, 3		%tmp = icmp ule i32 0, 3
%tmp1 = select i1 %tmp, float 1.000000e+00, float -1.000000e+00		%tmp1 = select i1 %tmp, float 1.000000e+00, float -1.000000e+00
%c1 = fcmp oge float %tmp1, 0.0		%c1 = fcmp oge float %tmp1, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
%tmp2 = icmp ule i32 3, 0		%tmp2 = icmp ule i32 3, 0
%tmp3 = select i1 %tmp2, float 1.000000e+00, float -1.000000e+00		%tmp3 = select i1 %tmp2, float 1.000000e+00, float -1.000000e+00
%c2 = fcmp oge float %tmp3, 0.0		%c2 = fcmp oge float %tmp3, 0.0
call void @llvm.amdgcn.kill(i1 %c2)		call void @llvm.amdgcn.kill(i1 %c2)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}vcc_implicit_def:		; GCN-LABEL: {{^}}vcc_implicit_def:
; GCN-NOT: v_cmp_gt_f32_e32 vcc,		; GCN: v_cmp_nge_f32_e32 vcc, 0, v{{[0-9]+}}
; GCN: v_cmp_gt_f32_e64 [[CMP:s\[[0-9]+:[0-9]+\]]], 0, v{{[0-9]+}}		; GCN: v_cmp_gt_f32_e64 [[CMP:s\[[0-9]+:[0-9]+\]]], 0, v{{[0-9]+}}
; SI: v_cmpx_le_f32_e32 vcc, 0, v{{[0-9]+}}		; GCN: s_andn2_b64 exec, exec, vcc
; GFX10: v_cmpx_le_f32_e32 0, v{{[0-9]+}}
; GCN: v_cndmask_b32_e64 v{{[0-9]+}}, 0, 1.0, [[CMP]]		; GCN: v_cndmask_b32_e64 v{{[0-9]+}}, 0, 1.0, [[CMP]]
define amdgpu_ps void @vcc_implicit_def(float %arg13, float %arg14) {		define amdgpu_ps void @vcc_implicit_def(float %arg13, float %arg14) {
%tmp0 = fcmp olt float %arg13, 0.000000e+00		%tmp0 = fcmp olt float %arg13, 0.000000e+00
%c1 = fcmp oge float %arg14, 0.0		%c1 = fcmp oge float %arg14, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
%tmp1 = select i1 %tmp0, float 1.000000e+00, float 0.000000e+00		%tmp1 = select i1 %tmp0, float 1.000000e+00, float 0.000000e+00
call void @llvm.amdgcn.exp.f32(i32 1, i32 15, float %tmp1, float %tmp1, float %tmp1, float %tmp1, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 1, i32 15, float %tmp1, float %tmp1, float %tmp1, float %tmp1, i1 true, i1 true) #0
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}true:		; GCN-LABEL: {{^}}true:
; GCN-NEXT: %bb.		; GCN-NEXT: %bb.
; GCN-NEXT: %bb.
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_gs void @true() {		define amdgpu_gs void @true() {
call void @llvm.amdgcn.kill(i1 true)		call void @llvm.amdgcn.kill(i1 true)
ret void		ret void
}		}

; GCN-LABEL: {{^}}false:		; GCN-LABEL: {{^}}false:
; GCN-NOT: v_cmpx		; GCN-NOT: v_cmpx
; GCN: s_mov_b64 exec, 0		; GCN: s_mov_b64 exec, 0
define amdgpu_gs void @false() {		define amdgpu_gs void @false() {
call void @llvm.amdgcn.kill(i1 false)		call void @llvm.amdgcn.kill(i1 false)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}and:		; GCN-LABEL: {{^}}and:
; GCN: v_cmp_lt_i32		; GCN: v_cmp_lt_i32
; GCN: v_cmp_lt_i32		; GCN: v_cmp_lt_i32
; GCN: s_or_b64 s[0:1]		; GCN: s_or_b64 s[0:1]
; GCN: s_and_b64 exec, exec, s[0:1]		; GCN: s_xor_b64 s[0:1], s[0:1], exec
		; GCN: s_andn2_b64 s[2:3], s[2:3], s[0:1]
		; GCN: s_and_b64 exec, exec, s[2:3]
define amdgpu_gs void @and(i32 %a, i32 %b, i32 %c, i32 %d) {		define amdgpu_gs void @and(i32 %a, i32 %b, i32 %c, i32 %d) {
%c1 = icmp slt i32 %a, %b		%c1 = icmp slt i32 %a, %b
%c2 = icmp slt i32 %c, %d		%c2 = icmp slt i32 %c, %d
%x = or i1 %c1, %c2		%x = or i1 %c1, %c2
call void @llvm.amdgcn.kill(i1 %x)		call void @llvm.amdgcn.kill(i1 %x)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}andn2:		; GCN-LABEL: {{^}}andn2:
; GCN: v_cmp_lt_i32		; GCN: v_cmp_lt_i32
; GCN: v_cmp_lt_i32		; GCN: v_cmp_lt_i32
; GCN: s_xor_b64 s[0:1]		; GCN: s_xor_b64 s[0:1]
; GCN: s_andn2_b64 exec, exec, s[0:1]		; GCN: s_andn2_b64 s[2:3], s[2:3], s[0:1]
		; GCN: s_and_b64 exec, exec, s[2:3]
define amdgpu_gs void @andn2(i32 %a, i32 %b, i32 %c, i32 %d) {		define amdgpu_gs void @andn2(i32 %a, i32 %b, i32 %c, i32 %d) {
%c1 = icmp slt i32 %a, %b		%c1 = icmp slt i32 %a, %b
%c2 = icmp slt i32 %c, %d		%c2 = icmp slt i32 %c, %d
%x = xor i1 %c1, %c2		%x = xor i1 %c1, %c2
%y = xor i1 %x, 1		%y = xor i1 %x, 1
call void @llvm.amdgcn.kill(i1 %y)		call void @llvm.amdgcn.kill(i1 %y)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}oeq:		; GCN-LABEL: {{^}}oeq:
; GCN: v_cmpx_eq_f32		; GCN: v_cmp_nlg_f32
; GCN-NOT: s_and
define amdgpu_gs void @oeq(float %a) {		define amdgpu_gs void @oeq(float %a) {
		piotrUnsubmitted Done Reply Inline Actions The generated code for this test (and a few others) is slightly unexpected (all three patches combined): Before: v_cmpx_lt_f32_e32 vcc, 0, v0 After: v_cmp_gt_f32_e32 vcc, 0, v0 s_andn2_b64 exec, exec, vcc s_andn2_b64 exec, exec, vcc piotr: The generated code for this test (and a few others) is slightly unexpected (all three patches…
		critsonAuthorUnsubmitted Done Reply Inline Actions So what is happening is the mask update and the exec update use the same register, and shader is marked GS. Post WQM: // live mask generated: %3:sreg_64 = COPY $exec // kill: %0:vgpr_32 = COPY $vgpr0 V_CMP_GT_F32_e32 0, %0:vgpr_32, implicit-def $vcc, implicit $mode, implicit $exec // live mask update: dead %3:sreg_64 = S_ANDN2_B64 %3:sreg_64, $vcc, implicit-def $scc SI_EARLY_TERMINATE_SCC0 implicit $exec, implicit $scc // kill implemented: $exec = S_ANDN2_B64 $exec, $vcc, implicit-def $scc Here the SI_EARLY_TERMINATE_SCC0 generates nothing because the test shader is marked amdgpu_gs. I think the test shaders are too trivial to be representative of real code generation. I added the sendmsg otherwise these shaders optimise away to nothing. It still could be reasonable to add a peephole to clean these up. critson: So what is happening is the mask update and the exec update use the same register, and shader…
		piotrUnsubmitted Done Reply Inline Actions Fair enough. Thanks for checking. piotr: Fair enough. Thanks for checking.
%c1 = fcmp oeq float %a, 0.0		%c1 = fcmp oeq float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}ogt:		; GCN-LABEL: {{^}}ogt:
; GCN: v_cmpx_lt_f32		; GCN: v_cmp_nge_f32
; GCN-NOT: s_and
define amdgpu_gs void @ogt(float %a) {		define amdgpu_gs void @ogt(float %a) {
%c1 = fcmp ogt float %a, 0.0		%c1 = fcmp ogt float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}oge:		; GCN-LABEL: {{^}}oge:
; GCN: v_cmpx_le_f32		; GCN: v_cmp_nge_f32
; GCN-NOT: s_and
define amdgpu_gs void @oge(float %a) {		define amdgpu_gs void @oge(float %a) {
%c1 = fcmp oge float %a, 0.0		%c1 = fcmp oge float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}olt:		; GCN-LABEL: {{^}}olt:
; GCN: v_cmpx_gt_f32		; GCN: v_cmp_nle_f32
; GCN-NOT: s_and
define amdgpu_gs void @olt(float %a) {		define amdgpu_gs void @olt(float %a) {
%c1 = fcmp olt float %a, 0.0		%c1 = fcmp olt float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}ole:		; GCN-LABEL: {{^}}ole:
; GCN: v_cmpx_ge_f32		; GCN: v_cmp_nlt_f32
; GCN-NOT: s_and
define amdgpu_gs void @ole(float %a) {		define amdgpu_gs void @ole(float %a) {
%c1 = fcmp ole float %a, 0.0		%c1 = fcmp ole float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}one:		; GCN-LABEL: {{^}}one:
; GCN: v_cmpx_lg_f32		; GCN: v_cmp_neq_f32
; GCN-NOT: s_and
define amdgpu_gs void @one(float %a) {		define amdgpu_gs void @one(float %a) {
%c1 = fcmp one float %a, 0.0		%c1 = fcmp one float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}ord:		; GCN-LABEL: {{^}}ord:
; FIXME: This is absolutely unimportant, but we could use the cmpx variant here.
; GCN: v_cmp_o_f32		; GCN: v_cmp_o_f32
define amdgpu_gs void @ord(float %a) {		define amdgpu_gs void @ord(float %a) {
%c1 = fcmp ord float %a, 0.0		%c1 = fcmp ord float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}uno:		; GCN-LABEL: {{^}}uno:
; FIXME: This is absolutely unimportant, but we could use the cmpx variant here.
; GCN: v_cmp_u_f32		; GCN: v_cmp_u_f32
define amdgpu_gs void @uno(float %a) {		define amdgpu_gs void @uno(float %a) {
%c1 = fcmp uno float %a, 0.0		%c1 = fcmp uno float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}ueq:		; GCN-LABEL: {{^}}ueq:
; GCN: v_cmpx_nlg_f32		; GCN: v_cmp_lg_f32
; GCN-NOT: s_and
define amdgpu_gs void @ueq(float %a) {		define amdgpu_gs void @ueq(float %a) {
%c1 = fcmp ueq float %a, 0.0		%c1 = fcmp ueq float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}ugt:		; GCN-LABEL: {{^}}ugt:
; GCN: v_cmpx_nge_f32		; GCN: v_cmp_ge_f32
; GCN-NOT: s_and
define amdgpu_gs void @ugt(float %a) {		define amdgpu_gs void @ugt(float %a) {
%c1 = fcmp ugt float %a, 0.0		%c1 = fcmp ugt float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}uge:		; GCN-LABEL: {{^}}uge:
; SI: v_cmpx_ngt_f32_e32 vcc, -1.0		; GCN: v_cmp_gt_f32_e32 vcc, -1.0
; GFX10: v_cmpx_ngt_f32_e32 -1.0
; GCN-NOT: s_and
define amdgpu_gs void @uge(float %a) {		define amdgpu_gs void @uge(float %a) {
%c1 = fcmp uge float %a, -1.0		%c1 = fcmp uge float %a, -1.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}ult:		; GCN-LABEL: {{^}}ult:
; SI: v_cmpx_nle_f32_e32 vcc, -2.0		; GCN: v_cmp_le_f32_e32 vcc, -2.0
; GFX10: v_cmpx_nle_f32_e32 -2.0
; GCN-NOT: s_and
define amdgpu_gs void @ult(float %a) {		define amdgpu_gs void @ult(float %a) {
%c1 = fcmp ult float %a, -2.0		%c1 = fcmp ult float %a, -2.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}ule:		; GCN-LABEL: {{^}}ule:
; SI: v_cmpx_nlt_f32_e32 vcc, 2.0		; GCN: v_cmp_lt_f32_e32 vcc, 2.0
; GFX10: v_cmpx_nlt_f32_e32 2.0
; GCN-NOT: s_and
define amdgpu_gs void @ule(float %a) {		define amdgpu_gs void @ule(float %a) {
%c1 = fcmp ule float %a, 2.0		%c1 = fcmp ule float %a, 2.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}une:		; GCN-LABEL: {{^}}une:
; SI: v_cmpx_neq_f32_e32 vcc, 0		; GCN: v_cmp_eq_f32_e32 vcc, 0
; GFX10: v_cmpx_neq_f32_e32 0
; GCN-NOT: s_and
define amdgpu_gs void @une(float %a) {		define amdgpu_gs void @une(float %a) {
%c1 = fcmp une float %a, 0.0		%c1 = fcmp une float %a, 0.0
call void @llvm.amdgcn.kill(i1 %c1)		call void @llvm.amdgcn.kill(i1 %c1)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}neg_olt:		; GCN-LABEL: {{^}}neg_olt:
; SI: v_cmpx_ngt_f32_e32 vcc, 1.0		; GCN: v_cmp_gt_f32_e32 vcc, 1.0
; GFX10: v_cmpx_ngt_f32_e32 1.0
; GCN-NOT: s_and
define amdgpu_gs void @neg_olt(float %a) {		define amdgpu_gs void @neg_olt(float %a) {
%c1 = fcmp olt float %a, 1.0		%c1 = fcmp olt float %a, 1.0
%c2 = xor i1 %c1, 1		%c2 = xor i1 %c1, 1
call void @llvm.amdgcn.kill(i1 %c2)		call void @llvm.amdgcn.kill(i1 %c2)
		call void @llvm.amdgcn.s.sendmsg(i32 3, i32 0)
ret void		ret void
}		}

; GCN-LABEL: {{^}}fcmp_x2:		; GCN-LABEL: {{^}}fcmp_x2:
; FIXME: LLVM should be able to combine these fcmp opcodes.		; FIXME: LLVM should be able to combine these fcmp opcodes.
; SI: v_cmp_lt_f32_e32 vcc, s{{[0-9]+}}, v0		; SI: v_cmp_lt_f32_e32 vcc, s{{[0-9]+}}, v0
; GFX10: v_cmp_lt_f32_e32 vcc, 0x3e800000, v0		; GFX10: v_cmp_lt_f32_e32 vcc, 0x3e800000, v0
; GCN: v_cndmask_b32		; GCN: v_cndmask_b32
; GCN: v_cmpx_le_f32		; GCN: v_cmp_nge_f32
define amdgpu_ps void @fcmp_x2(float %a) #0 {		define amdgpu_ps void @fcmp_x2(float %a) #0 {
%ogt = fcmp nsz ogt float %a, 2.500000e-01		%ogt = fcmp nsz ogt float %a, 2.500000e-01
%k = select i1 %ogt, float -1.000000e+00, float 0.000000e+00		%k = select i1 %ogt, float -1.000000e+00, float 0.000000e+00
%c = fcmp nsz oge float %k, 0.000000e+00		%c = fcmp nsz oge float %k, 0.000000e+00
call void @llvm.amdgcn.kill(i1 %c) #1		call void @llvm.amdgcn.kill(i1 %c) #1
ret void		ret void
}		}

		; Note: an almost identical test for this exists in llvm.amdgcn.wqm.vote.ll
; GCN-LABEL: {{^}}wqm:		; GCN-LABEL: {{^}}wqm:
; GCN: v_cmp_neq_f32_e32 vcc, 0		; GCN: v_cmp_neq_f32_e32 vcc, 0
; GCN: s_wqm_b64 s[0:1], vcc		; GCN-DAG: s_wqm_b64 s[2:3], vcc
		; GCN-DAG: s_mov_b64 s[0:1], exec
		; GCN: s_xor_b64 s[2:3], s[2:3], exec
		; GCN: s_andn2_b64 s[0:1], s[0:1], s[2:3]
; GCN: s_and_b64 exec, exec, s[0:1]		; GCN: s_and_b64 exec, exec, s[0:1]
define amdgpu_ps void @wqm(float %a) {		define amdgpu_ps float @wqm(float %a) {
%c1 = fcmp une float %a, 0.0		%c1 = fcmp une float %a, 0.0
%c2 = call i1 @llvm.amdgcn.wqm.vote(i1 %c1)		%c2 = call i1 @llvm.amdgcn.wqm.vote(i1 %c1)
call void @llvm.amdgcn.kill(i1 %c2)		call void @llvm.amdgcn.kill(i1 %c2)
ret void		ret float 0.0
}		}

; This checks that we use the 64-bit encoding when the operand is a SGPR.		; This checks that we use the 64-bit encoding when the operand is a SGPR.
; GCN-LABEL: {{^}}test_sgpr:		; GCN-LABEL: {{^}}test_sgpr:
; GCN: v_cmpx_ge_f32_e64		; GCN: v_cmp_ngt_f32_e64
define amdgpu_ps void @test_sgpr(float inreg %a) #0 {		define amdgpu_ps void @test_sgpr(float inreg %a) #0 {
%c = fcmp ole float %a, 1.000000e+00		%c = fcmp ole float %a, 1.000000e+00
call void @llvm.amdgcn.kill(i1 %c) #1		call void @llvm.amdgcn.kill(i1 %c) #1
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_non_inline_imm_sgpr:		; GCN-LABEL: {{^}}test_non_inline_imm_sgpr:
; GCN-NOT: v_cmpx_ge_f32_e64		; GCN-NOT: v_cmp_le_f32_e64
define amdgpu_ps void @test_non_inline_imm_sgpr(float inreg %a) #0 {		define amdgpu_ps void @test_non_inline_imm_sgpr(float inreg %a) #0 {
%c = fcmp ole float %a, 1.500000e+00		%c = fcmp ole float %a, 1.500000e+00
call void @llvm.amdgcn.kill(i1 %c) #1		call void @llvm.amdgcn.kill(i1 %c) #1
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_scc_liveness:		; GCN-LABEL: {{^}}test_scc_liveness:
; GCN: v_cmp		; GCN: v_cmp
Show All 10 Lines	loop3: ; preds = %loop3, %main_body
call void @llvm.amdgcn.kill(i1 %tmp1) #1		call void @llvm.amdgcn.kill(i1 %tmp1) #1
%tmp5 = add i32 %tmp, 1		%tmp5 = add i32 %tmp, 1
br i1 %tmp1, label %endloop15, label %loop3		br i1 %tmp1, label %endloop15, label %loop3

endloop15: ; preds = %loop3		endloop15: ; preds = %loop3
ret void		ret void
}		}

		; Check this compiles.
		; If kill is marked as defining VCC then this will fail with live interval issues.
		; GCN-LABEL: {{^}}kill_with_loop_exit:
		; GCN: s_mov_b64 [[LIVE:s\[[0-9]+:[0-9]+\]]], exec
		; GCN: s_andn2_b64 [[LIVE]], [[LIVE]], exec
		; GCN-NEXT: s_cbranch_scc0
		define amdgpu_ps void @kill_with_loop_exit(float inreg %inp0, float inreg %inp1, <4 x i32> inreg %inp2, float inreg %inp3) {
		.entry:
		%tmp24 = fcmp olt float %inp0, 1.280000e+02
		%tmp25 = fcmp olt float %inp1, 1.280000e+02
		%tmp26 = and i1 %tmp24, %tmp25
		br i1 %tmp26, label %bb35, label %.preheader1.preheader

		.preheader1.preheader: ; preds = %.entry
		%tmp31 = fcmp ogt float %inp3, 0.0
		br label %bb

		bb: ; preds = %bb, %.preheader1.preheader
		%tmp30 = phi float [ %tmp32, %bb ], [ 1.500000e+00, %.preheader1.preheader ]
		%tmp32 = fadd reassoc nnan nsz arcp contract float %tmp30, 2.500000e-01
		%tmp34 = fadd reassoc nnan nsz arcp contract float %tmp30, 2.500000e-01
		br i1 %tmp31, label %bb, label %bb33

		bb33: ; preds = %bb
		call void @llvm.amdgcn.kill(i1 false)
		br label %bb35

		bb35: ; preds = %bb33, %.entry
		%tmp36 = phi float [ %tmp34, %bb33 ], [ 1.000000e+00, %.entry ]
		call void @llvm.amdgcn.exp.f32(i32 immarg 0, i32 immarg 15, float %tmp36, float %tmp36, float %tmp36, float %tmp36, i1 immarg true, i1 immarg true) #3
		ret void
		}

declare void @llvm.amdgcn.kill(i1) #0		declare void @llvm.amdgcn.kill(i1) #0
declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0
		declare void @llvm.amdgcn.s.sendmsg(i32, i32) #0
declare i1 @llvm.amdgcn.wqm.vote(i1)		declare i1 @llvm.amdgcn.wqm.vote(i1)

attributes #0 = { nounwind }		attributes #0 = { nounwind }

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.vote.ll

	Show All 28 Lines
	;WAVE32: s_wqm_b32			;WAVE32: s_wqm_b32
	define amdgpu_ps float @false() #1 {			define amdgpu_ps float @false() #1 {
	main_body:			main_body:
	%w = call i1 @llvm.amdgcn.wqm.vote(i1 false)			%w = call i1 @llvm.amdgcn.wqm.vote(i1 false)
	%r = select i1 %w, float 1.0, float 0.0			%r = select i1 %w, float 1.0, float 0.0
	ret float %r			ret float %r
	}			}

				; Note: an almost identical test for this exists in llvm.amdgcn.kill.ll
	;CHECK-LABEL: {{^}}kill:			;CHECK-LABEL: {{^}}kill:
	;CHECK: v_cmp_eq_u32_e32 [[CMP:[^,]+]], v0, v1			;CHECK: v_cmp_eq_u32_e32 [[CMP:[^,]+]], v0, v1

	;WAVE64: s_wqm_b64 [[WQM:[^,]+]], [[CMP]]			;WAVE64: s_wqm_b64 [[WQM:[^,]+]], [[CMP]]
	;WAVE64: s_and_b64 exec, exec, [[WQM]]			;WAVE64: s_xor_b64 [[KILL:[^,]+]], [[WQM]], exec
				;WAVE64: s_andn2_b64 [[MASK:[^,]+]], [[EXEC:[^,]+]], [[KILL]]
				;WAVE64: s_and_b64 exec, exec, [[MASK]]

	;WAVE32: s_wqm_b32 [[WQM:[^,]+]], [[CMP]]			;WAVE32: s_wqm_b32 [[WQM:[^,]+]], [[CMP]]
	;WAVE32: s_and_b32 exec_lo, exec_lo, [[WQM]]			;WAVE32: s_xor_b32 [[KILL:[^,]+]], [[WQM]], exec
				;WAVE32: s_andn2_b32 [[MASK:[^,]+]], [[EXEC:[^,]+]], [[KILL]]
				;WAVE32: s_and_b32 exec_lo, exec_lo, [[MASK]]

	;CHECK: s_endpgm			;CHECK: s_endpgm
	define amdgpu_ps void @kill(i32 %v0, i32 %v1) #1 {			define amdgpu_ps float @kill(i32 %v0, i32 %v1) #1 {
	main_body:			main_body:
	%c = icmp eq i32 %v0, %v1			%c = icmp eq i32 %v0, %v1
	%w = call i1 @llvm.amdgcn.wqm.vote(i1 %c)			%w = call i1 @llvm.amdgcn.wqm.vote(i1 %c)
	call void @llvm.amdgcn.kill(i1 %w)			call void @llvm.amdgcn.kill(i1 %w)
	ret void			ret float 0.0
	}			}

	declare void @llvm.amdgcn.kill(i1) #1			declare void @llvm.amdgcn.kill(i1) #1
	declare i1 @llvm.amdgcn.wqm.vote(i1)			declare i1 @llvm.amdgcn.wqm.vote(i1)

	attributes #1 = { nounwind }			attributes #1 = { nounwind }

llvm/test/CodeGen/AMDGPU/skip-if-dead.ll

; RUN: llc -march=amdgcn -verify-machineinstrs -simplifycfg-require-and-preserve-domtree=1 < %s \| FileCheck %s		; RUN: llc -march=amdgcn -verify-machineinstrs -simplifycfg-require-and-preserve-domtree=1 < %s \| FileCheck %s

; CHECK-LABEL: {{^}}test_kill_depth_0_imm_pos:		; CHECK-LABEL: {{^}}test_kill_depth_0_imm_pos:
; CHECK-NEXT: ; %bb.0:		; CHECK-NEXT: ; %bb.0:
; CHECK-NEXT: ; %bb.1:
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_depth_0_imm_pos() #0 {		define amdgpu_ps void @test_kill_depth_0_imm_pos() #0 {
call void @llvm.amdgcn.kill(i1 true)		call void @llvm.amdgcn.kill(i1 true)
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_kill_depth_0_imm_neg:		; CHECK-LABEL: {{^}}test_kill_depth_0_imm_neg:
; CHECK-NEXT: ; %bb.0:		; CHECK-NEXT: ; %bb.0:
; CHECK-NEXT: s_mov_b64 exec, 0		; CHECK-NEXT: s_andn2
; CHECK-NEXT: s_cbranch_execz BB1_2		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
; CHECK-NEXT: ; %bb.1:
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
; CHECK-NEXT: BB1_2:		; CHECK-NEXT: [[EXIT_BB]]:
		; CHECK-NEXT: s_mov_b64 exec, 0
; CHECK-NEXT: exp null off, off, off, off done vm		; CHECK-NEXT: exp null off, off, off, off done vm
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_depth_0_imm_neg() #0 {		define amdgpu_ps void @test_kill_depth_0_imm_neg() #0 {
call void @llvm.amdgcn.kill(i1 false)		call void @llvm.amdgcn.kill(i1 false)
ret void		ret void
}		}

; FIXME: Ideally only one would be emitted		; FIXME: Ideally only one would be emitted
; CHECK-LABEL: {{^}}test_kill_depth_0_imm_neg_x2:		; CHECK-LABEL: {{^}}test_kill_depth_0_imm_neg_x2:
; CHECK-NEXT: ; %bb.0:		; CHECK-NEXT: ; %bb.0:
; CHECK-NEXT: s_mov_b64 exec, 0		; CHECK-NEXT: s_mov_b64 s[0:1], exec
; CHECK-NEXT: s_cbranch_execz BB2_3		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], exec
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
; CHECK-NEXT: ; %bb.1:		; CHECK-NEXT: ; %bb.1:
; CHECK-NEXT: s_mov_b64 exec, 0		; CHECK-NEXT: s_mov_b64 exec, 0
; CHECK-NEXT: s_cbranch_execz BB2_3		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], exec
; CHECK-NEXT: ; %bb.2:		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB]]
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
; CHECK-NEXT: BB2_3:		; CHECK-NEXT: [[EXIT_BB]]:
; CHECK: exp null		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_depth_0_imm_neg_x2() #0 {		define amdgpu_ps void @test_kill_depth_0_imm_neg_x2() #0 {
call void @llvm.amdgcn.kill(i1 false)		call void @llvm.amdgcn.kill(i1 false)
call void @llvm.amdgcn.kill(i1 false)		call void @llvm.amdgcn.kill(i1 false)
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_kill_depth_var:		; CHECK-LABEL: {{^}}test_kill_depth_var:
; CHECK-NEXT: ; %bb.0:		; CHECK-NEXT: ; %bb.0:
; CHECK-NEXT: v_cmpx_gt_f32_e32 vcc, 0, v0		; CHECK-NEXT: v_cmp_nle_f32_e32 vcc, 0, v0
; CHECK-NEXT: s_cbranch_execz BB3_2		; CHECK-NEXT: s_andn2_b64 exec, exec, vcc
; CHECK-NEXT: ; %bb.1:		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
; CHECK-NEXT: BB3_2:		; CHECK-NEXT: [[EXIT_BB]]:
; CHECK: exp null		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_depth_var(float %x) #0 {		define amdgpu_ps void @test_kill_depth_var(float %x) #0 {
%cmp = fcmp olt float %x, 0.0		%cmp = fcmp olt float %x, 0.0
call void @llvm.amdgcn.kill(i1 %cmp)		call void @llvm.amdgcn.kill(i1 %cmp)
ret void		ret void
}		}

; FIXME: Ideally only one would be emitted		; FIXME: Ideally only one would be emitted
; CHECK-LABEL: {{^}}test_kill_depth_var_x2_same:		; CHECK-LABEL: {{^}}test_kill_depth_var_x2_same:
; CHECK-NEXT: ; %bb.0:		; CHECK-NEXT: ; %bb.0:
; CHECK-NEXT: v_cmpx_gt_f32_e32 vcc, 0, v0		; CHECK-NEXT: s_mov_b64 s[0:1], exec
; CHECK-NEXT: s_cbranch_execz BB4_3		; CHECK-NEXT: v_cmp_nle_f32_e32 vcc, 0, v0
		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], vcc
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
; CHECK-NEXT: ; %bb.1:		; CHECK-NEXT: ; %bb.1:
; CHECK-NEXT: v_cmpx_gt_f32_e32 vcc, 0, v0		; CHECK-NEXT: s_andn2_b64 exec, exec, vcc
; CHECK-NEXT: s_cbranch_execz BB4_3		; CHECK-NEXT: v_cmp_nle_f32_e32 vcc, 0, v0
; CHECK-NEXT: ; %bb.2:		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], vcc
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB]]
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
; CHECK-NEXT: BB4_3:		; CHECK-NEXT: [[EXIT_BB]]:
; CHECK: exp null		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_depth_var_x2_same(float %x) #0 {		define amdgpu_ps void @test_kill_depth_var_x2_same(float %x) #0 {
%cmp = fcmp olt float %x, 0.0		%cmp = fcmp olt float %x, 0.0
call void @llvm.amdgcn.kill(i1 %cmp)		call void @llvm.amdgcn.kill(i1 %cmp)
call void @llvm.amdgcn.kill(i1 %cmp)		call void @llvm.amdgcn.kill(i1 %cmp)
ret void		ret void
}		}

; FIXME: Ideally only one early-exit would be emitted		; FIXME: Ideally only one early-exit would be emitted
; CHECK-LABEL: {{^}}test_kill_depth_var_x2:		; CHECK-LABEL: {{^}}test_kill_depth_var_x2:
; CHECK-NEXT: ; %bb.0:		; CHECK-NEXT: ; %bb.0:
; CHECK-NEXT: v_cmpx_gt_f32_e32 vcc, 0, v0		; CHECK-NEXT: s_mov_b64 s[0:1], exec
; CHECK-NEXT: s_cbranch_execz BB5_3		; CHECK-NEXT: v_cmp_nle_f32_e32 vcc, 0, v0
; CHECK-NEXT: ; %bb.1		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], vcc
; CHECK-NEXT: v_cmpx_gt_f32_e32 vcc, 0, v1		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
; CHECK-NEXT: s_cbranch_execz BB5_3		; CHECK-NEXT: ; %bb.1:
; CHECK-NEXT: ; %bb.2		; CHECK-NEXT: s_andn2_b64 exec, exec, vcc
		; CHECK-NEXT: v_cmp_nle_f32_e32 vcc, 0, v1
		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], vcc
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB]]
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
; CHECK-NEXT: BB5_3:		; CHECK-NEXT: [[EXIT_BB]]:
; CHECK: exp null		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_depth_var_x2(float %x, float %y) #0 {		define amdgpu_ps void @test_kill_depth_var_x2(float %x, float %y) #0 {
%cmp.x = fcmp olt float %x, 0.0		%cmp.x = fcmp olt float %x, 0.0
call void @llvm.amdgcn.kill(i1 %cmp.x)		call void @llvm.amdgcn.kill(i1 %cmp.x)
%cmp.y = fcmp olt float %y, 0.0		%cmp.y = fcmp olt float %y, 0.0
call void @llvm.amdgcn.kill(i1 %cmp.y)		call void @llvm.amdgcn.kill(i1 %cmp.y)
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_kill_depth_var_x2_instructions:		; CHECK-LABEL: {{^}}test_kill_depth_var_x2_instructions:
; CHECK-NEXT: ; %bb.0:		; CHECK-NEXT: ; %bb.0:
; CHECK-NEXT: v_cmpx_gt_f32_e32 vcc, 0, v0		; CHECK-NEXT: s_mov_b64 s[0:1], exec
; CHECK-NEXT: s_cbranch_execz BB6_3		; CHECK-NEXT: v_cmp_nle_f32_e32 vcc, 0, v0
		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], vcc
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
; CHECK-NEXT: ; %bb.1:		; CHECK-NEXT: ; %bb.1:
		; CHECK-NEXT: s_andn2_b64 exec, exec, vcc
; CHECK: v_mov_b32_e64 v7, -1		; CHECK: v_mov_b32_e64 v7, -1
; CHECK: v_cmpx_gt_f32_e32 vcc, 0, v7		; CHECK: v_cmp_nle_f32_e32 vcc, 0, v7
; CHECK-NEXT: s_cbranch_execz BB6_3		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], vcc
; CHECK-NEXT: ; %bb.2:		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB]]
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
; CHECK-NEXT: BB6_3:		; CHECK-NEXT: [[EXIT_BB]]:
		; CHECK-NEXT: s_mov_b64 exec, 0
; CHECK-NEXT: exp null		; CHECK-NEXT: exp null
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_depth_var_x2_instructions(float %x) #0 {		define amdgpu_ps void @test_kill_depth_var_x2_instructions(float %x) #0 {
%cmp.x = fcmp olt float %x, 0.0		%cmp.x = fcmp olt float %x, 0.0
call void @llvm.amdgcn.kill(i1 %cmp.x)		call void @llvm.amdgcn.kill(i1 %cmp.x)
%y = call float asm sideeffect "v_mov_b32_e64 v7, -1", "={v7}"()		%y = call float asm sideeffect "v_mov_b32_e64 v7, -1", "={v7}"()
%cmp.y = fcmp olt float %y, 0.0		%cmp.y = fcmp olt float %y, 0.0
call void @llvm.amdgcn.kill(i1 %cmp.y)		call void @llvm.amdgcn.kill(i1 %cmp.y)
ret void		ret void
}		}

; FIXME: why does the skip depend on the asm length in the same block?		; FIXME: why does the skip depend on the asm length in the same block?

; CHECK-LABEL: {{^}}test_kill_control_flow:		; CHECK-LABEL: {{^}}test_kill_control_flow:
; CHECK: s_cmp_lg_u32 s{{[0-9]+}}, 0		; CHECK: s_cmp_lg_u32 s{{[0-9]+}}, 0
; CHECK: s_cbranch_scc1 [[RETURN_BB:BB[0-9]+_[0-9]+]]		; CHECK: s_cbranch_scc0 [[BODY_BB:BB[0-9]+_[0-9]+]]

; CHECK-NEXT: ; %bb.1:		; CHECK: v_mov_b32_e32 v0, 1.0
		; CHECK: s_branch [[RETURN_BB:BB[0-9]+_[0-9]+]]

		; [[BODY_BB]]:
; CHECK: v_mov_b32_e64 v7, -1		; CHECK: v_mov_b32_e64 v7, -1
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64

; CHECK: v_cmpx_gt_f32_e32 vcc, 0, v7		; CHECK: v_cmp_nle_f32_e32 vcc, 0, v7
		; CHECK-NEXT: s_andn2_b64 s[2:3], s[2:3], vcc
; TODO: We could do an early-exit here (the branch above is uniform!)		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
; CHECK-NOT: exp null

		; CHECK: s_andn2_b64 exec, exec, vcc
; CHECK: v_mov_b32_e32 v0, 1.0		; CHECK: v_mov_b32_e32 v0, 1.0

		; CHECK: [[EXIT_BB]]
		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
		; CHECK-NEXT: s_endpgm
define amdgpu_ps float @test_kill_control_flow(i32 inreg %arg) #0 {		define amdgpu_ps float @test_kill_control_flow(i32 inreg %arg) #0 {
entry:		entry:
%cmp = icmp eq i32 %arg, 0		%cmp = icmp eq i32 %arg, 0
br i1 %cmp, label %bb, label %exit		br i1 %cmp, label %bb, label %exit

bb:		bb:
%var = call float asm sideeffect "		%var = call float asm sideeffect "
v_mov_b32_e64 v7, -1		v_mov_b32_e64 v7, -1
Show All 28 Lines
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: ;;#ASMEND		; CHECK: ;;#ASMEND
; CHECK: v_mov_b32_e64 v8, -1		; CHECK: v_mov_b32_e64 v8, -1
; CHECK: ;;#ASMEND		; CHECK: ;;#ASMEND
; CHECK: v_cmpx_gt_f32_e32 vcc, 0, v7		; CHECK: v_cmp_nle_f32_e32 vcc, 0, v7
		; CHECK-NEXT: s_andn2_b64 s[2:3], s[2:3], vcc
; TODO: We could do an early-exit here (the branch above is uniform!)		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
; CHECK-NOT: exp null

; CHECK: buffer_store_dword v8		; CHECK: buffer_store_dword v8
; CHECK: v_mov_b32_e64 v9, -2		; CHECK: v_mov_b32_e64 v9, -2

; CHECK: {{^}}BB{{[0-9]+_[0-9]+}}:		; CHECK: {{^}}BB{{[0-9]+_[0-9]+}}:
; CHECK: buffer_store_dword v9		; CHECK: buffer_store_dword v9
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm

		; CHECK: [[EXIT_BB]]
		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_control_flow_remainder(i32 inreg %arg) #0 {		define amdgpu_ps void @test_kill_control_flow_remainder(i32 inreg %arg) #0 {
entry:		entry:
%cmp = icmp eq i32 %arg, 0		%cmp = icmp eq i32 %arg, 0
br i1 %cmp, label %bb, label %exit		br i1 %cmp, label %bb, label %exit

bb:		bb:
%var = call float asm sideeffect "		%var = call float asm sideeffect "
v_mov_b32_e64 v7, -1		v_mov_b32_e64 v7, -1
Show All 18 Lines
exit:		exit:
%phi = phi float [ 0.0, %entry ], [ %live.out, %bb ]		%phi = phi float [ 0.0, %entry ], [ %live.out, %bb ]
store float %phi, float addrspace(1)* undef		store float %phi, float addrspace(1)* undef
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_kill_control_flow_return:		; CHECK-LABEL: {{^}}test_kill_control_flow_return:

		; CHECK: s_mov_b64 [[LIVE:s\[[0-9]+:[0-9]+\]]], exec
; CHECK: v_cmp_eq_u32_e64 [[KILL_CC:s\[[0-9]+:[0-9]+\]]], s0, 1		; CHECK: v_cmp_eq_u32_e64 [[KILL_CC:s\[[0-9]+:[0-9]+\]]], s0, 1
; CHECK: s_and_b64 exec, exec, s[2:3]		; CHECK: s_xor_b64 [[TMP:s\[[0-9]+:[0-9]+\]]], [[KILL_CC]], exec
; CHECK-NEXT: s_cbranch_execz [[EXIT_BB:BB[0-9]+_[0-9]+]]		; CHECK: s_andn2_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], [[LIVE]], [[TMP]]
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
		; CHECK: s_and_b64 exec, exec, [[MASK]]

; CHECK: s_cmp_lg_u32 s{{[0-9]+}}, 0		; CHECK: s_cmp_lg_u32 s{{[0-9]+}}, 0
; CHECK: s_cbranch_scc0 [[COND_BB:BB[0-9]+_[0-9]+]]		; CHECK: s_cbranch_scc0 [[COND_BB:BB[0-9]+_[0-9]+]]
; CHECK: s_branch [[RETURN_BB:BB[0-9]+_[0-9]+]]		; CHECK: s_branch [[RETURN_BB:BB[0-9]+_[0-9]+]]

; CHECK: [[COND_BB]]:		; CHECK: [[COND_BB]]:
; CHECK: v_mov_b32_e64 v7, -1		; CHECK: v_mov_b32_e64 v7, -1
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_mov_b32_e32 v0, v7		; CHECK: v_mov_b32_e32 v0, v7

; CHECK: [[EXIT_BB]]:		; CHECK: [[EXIT_BB]]:
		; CHECK-NEXT: s_mov_b64 exec, 0
; CHECK-NEXT: exp null		; CHECK-NEXT: exp null
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm

; CHECK: [[RETURN_BB]]:		; CHECK: [[RETURN_BB]]:
define amdgpu_ps float @test_kill_control_flow_return(i32 inreg %arg) #0 {		define amdgpu_ps float @test_kill_control_flow_return(i32 inreg %arg) #0 {
entry:		entry:
%kill = icmp eq i32 %arg, 1		%kill = icmp eq i32 %arg, 1
%cmp = icmp eq i32 %arg, 0		%cmp = icmp eq i32 %arg, 0
Show All 28 Lines

; CHECK: ; %bb.{{[0-9]+}}: ; %bb.preheader		; CHECK: ; %bb.{{[0-9]+}}: ; %bb.preheader
; CHECK: s_mov_b32		; CHECK: s_mov_b32

; CHECK: [[LOOP_BB:BB[0-9]+_[0-9]+]]:		; CHECK: [[LOOP_BB:BB[0-9]+_[0-9]+]]:

; CHECK: v_mov_b32_e64 v7, -1		; CHECK: v_mov_b32_e64 v7, -1
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_cmpx_gt_f32_e32 vcc, 0, v7		; CHECK: v_cmp_nle_f32_e32 vcc, 0, v7
		; CHECK-NEXT: s_andn2_b64 s[0:1], s[0:1], vcc
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]

; CHECK-NEXT: ; %bb.3:		; CHECK-NEXT: ; %bb.3:
; CHECK: buffer_load_dword [[LOAD:v[0-9]+]]		; CHECK: buffer_load_dword [[LOAD:v[0-9]+]]
; CHECK: v_cmp_eq_u32_e32 vcc, 0, [[LOAD]]		; CHECK: v_cmp_eq_u32_e32 vcc, 0, [[LOAD]]
; CHECK-NEXT: s_and_b64 vcc, exec, vcc		; CHECK-NEXT: s_and_b64 vcc, exec, vcc
; CHECK-NEXT: s_cbranch_vccnz [[LOOP_BB]]		; CHECK-NEXT: s_cbranch_vccnz [[LOOP_BB]]

; CHECK-NEXT: {{^}}[[EXIT]]:		; CHECK-NEXT: {{^}}[[EXIT]]:
; CHECK: s_or_b64 exec, exec, [[SAVEEXEC]]		; CHECK: s_or_b64 exec, exec, [[SAVEEXEC]]
; CHECK: buffer_store_dword		; CHECK: buffer_store_dword
; CHECK: s_endpgm		; CHECK: s_endpgm

		; CHECK: [[EXIT_BB]]:
		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @test_kill_divergent_loop(i32 %arg) #0 {		define amdgpu_ps void @test_kill_divergent_loop(i32 %arg) #0 {
entry:		entry:
%cmp = icmp eq i32 %arg, 0		%cmp = icmp eq i32 %arg, 0
br i1 %cmp, label %bb, label %exit		br i1 %cmp, label %bb, label %exit

bb:		bb:
%var = call float asm sideeffect "		%var = call float asm sideeffect "
v_mov_b32_e64 v7, -1		v_mov_b32_e64 v7, -1
Show All 16 Lines
exit:		exit:
store volatile i32 8, i32 addrspace(1)* undef		store volatile i32 8, i32 addrspace(1)* undef
ret void		ret void
}		}

; bug 28550		; bug 28550
; CHECK-LABEL: {{^}}phi_use_def_before_kill:		; CHECK-LABEL: {{^}}phi_use_def_before_kill:
; CHECK: v_cndmask_b32_e64 [[PHIREG:v[0-9]+]], 0, -1.0,		; CHECK: v_cndmask_b32_e64 [[PHIREG:v[0-9]+]], 0, -1.0,
; CHECK: v_cmpx_lt_f32_e32 vcc, 0,		; CHECK: v_cmp_nge_f32_e32 vcc, 0,
; CHECK-NEXT: s_cbranch_execz [[EXITBB:BB[0-9]+_[0-9]+]]		; CHECK-NEXT: s_andn2_b64
		; CHECK-NEXT: s_cbranch_scc0 [[EXITBB:BB[0-9]+_[0-9]+]]

; CHECK: ; %[[KILLBB:bb.[0-9]+]]:		; CHECK: ; %[[KILLBB:bb.[0-9]+]]:
		; CHECK-NEXT: s_andn2_b64
; CHECK-NEXT: s_cbranch_scc0 [[PHIBB:BB[0-9]+_[0-9]+]]		; CHECK-NEXT: s_cbranch_scc0 [[PHIBB:BB[0-9]+_[0-9]+]]

; CHECK: [[PHIBB]]:		; CHECK: [[PHIBB]]:
; CHECK: v_cmp_eq_f32_e32 vcc, 0, [[PHIREG]]		; CHECK: v_cmp_eq_f32_e32 vcc, 0, [[PHIREG]]
; CHECK: s_cbranch_vccz [[ENDBB:BB[0-9]+_[0-9]+]]		; CHECK: s_cbranch_vccz [[ENDBB:BB[0-9]+_[0-9]+]]

; CHECK: ; %bb10		; CHECK: ; %bb10
; CHECK: v_mov_b32_e32 v{{[0-9]+}}, 9		; CHECK: v_mov_b32_e32 v{{[0-9]+}}, 9
; CHECK: buffer_store_dword		; CHECK: buffer_store_dword

; CHECK: [[ENDBB]]:		; CHECK: [[ENDBB]]:
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm

; CHECK: [[EXITBB]]:		; CHECK: [[EXITBB]]:
; CHECK: exp null		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @phi_use_def_before_kill(float inreg %x) #0 {		define amdgpu_ps void @phi_use_def_before_kill(float inreg %x) #0 {
bb:		bb:
%tmp = fadd float %x, 1.000000e+00		%tmp = fadd float %x, 1.000000e+00
%tmp1 = fcmp olt float 0.000000e+00, %tmp		%tmp1 = fcmp olt float 0.000000e+00, %tmp
%tmp2 = select i1 %tmp1, float -1.000000e+00, float 0.000000e+00		%tmp2 = select i1 %tmp1, float -1.000000e+00, float 0.000000e+00
%cmp.tmp2 = fcmp olt float %tmp2, 0.0		%cmp.tmp2 = fcmp olt float %tmp2, 0.0
call void @llvm.amdgcn.kill(i1 %cmp.tmp2)		call void @llvm.amdgcn.kill(i1 %cmp.tmp2)
Show All 15 Lines
end:		end:
ret void		ret void
}		}

; CHECK-LABEL: {{^}}no_skip_no_successors:		; CHECK-LABEL: {{^}}no_skip_no_successors:
; CHECK: v_cmp_nge_f32		; CHECK: v_cmp_nge_f32
; CHECK: s_cbranch_vccz [[SKIPKILL:BB[0-9]+_[0-9]+]]		; CHECK: s_cbranch_vccz [[SKIPKILL:BB[0-9]+_[0-9]+]]

; CHECK: ; %bb6		; FIXME: ideally this should just be a s_branch
; CHECK: s_mov_b64 exec, 0		; CHECK: s_mov_b64 s[2:3], exec
		; CHECK-NEXT: s_andn2_b64 s[2:3], s[2:3], exec
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]
		; CHECK-NEXT: ; %bb6
		; CHECK-NEXT: s_mov_b64 exec, 0

; CHECK: [[SKIPKILL]]:		; CHECK: [[SKIPKILL]]:
; CHECK: v_cmp_nge_f32_e32 vcc		; CHECK: v_cmp_nge_f32_e32 vcc
; CHECK: %bb.3: ; %bb5		; CHECK: %bb.4: ; %bb5
; CHECK-NEXT: .Lfunc_end{{[0-9]+}}
		; CHECK: [[EXIT_BB]]
		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
		; CHECK-NEXT: s_endpgm
define amdgpu_ps void @no_skip_no_successors(float inreg %arg, float inreg %arg1) #0 {		define amdgpu_ps void @no_skip_no_successors(float inreg %arg, float inreg %arg1) #0 {
bb:		bb:
%tmp = fcmp ult float %arg1, 0.000000e+00		%tmp = fcmp ult float %arg1, 0.000000e+00
%tmp2 = fcmp ult float %arg, 0x3FCF5C2900000000		%tmp2 = fcmp ult float %arg, 0x3FCF5C2900000000
br i1 %tmp, label %bb6, label %bb3		br i1 %tmp, label %bb6, label %bb3

bb3: ; preds = %bb		bb3: ; preds = %bb
br i1 %tmp2, label %bb5, label %bb4		br i1 %tmp2, label %bb5, label %bb4
Show All 12 Lines	bb7: ; preds = %bb4
ret void		ret void
}		}

; CHECK-LABEL: {{^}}if_after_kill_block:		; CHECK-LABEL: {{^}}if_after_kill_block:
; CHECK: ; %bb.0:		; CHECK: ; %bb.0:
; CHECK: s_and_saveexec_b64		; CHECK: s_and_saveexec_b64
; CHECK: s_xor_b64		; CHECK: s_xor_b64

; CHECK: v_cmpx_gt_f32_e32 vcc, 0,		; CHECK: v_cmp_nle_f32_e32 vcc, 0,
; CHECK: BB{{[0-9]+_[0-9]+}}:		; CHECK: s_cbranch_scc0 [[EXIT_BB:BB[0-9]+_[0-9]+]]

; CHECK: s_or_b64 exec, exec		; CHECK: s_or_b64 exec, exec
; CHECK: image_sample_c		; CHECK: image_sample_c

; CHECK: v_cmp_neq_f32_e32 vcc, 0,		; CHECK: v_cmp_neq_f32_e32 vcc, 0,
; CHECK: s_and_saveexec_b64 s{{\[[0-9]+:[0-9]+\]}}, vcc		; CHECK: s_and_saveexec_b64 s{{\[[0-9]+:[0-9]+\]}}, vcc
; CHECK-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]		; CHECK-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]
; CHECK-NOT: branch		; CHECK-NOT: branch

; CHECK: ; %bb.{{[0-9]+}}: ; %bb8		; CHECK: ; %bb.{{[0-9]+}}: ; %bb8
; CHECK: buffer_store_dword		; CHECK: buffer_store_dword

; CHECK: [[END]]:		; CHECK: [[END]]:
; CHECK: s_endpgm		; CHECK: s_endpgm

		; CHECK: [[EXIT_BB]]:
		; CHECK-NEXT: s_mov_b64 exec, 0
		; CHECK-NEXT: exp null
		; CHECK-NEXT: s_endpgm

define amdgpu_ps void @if_after_kill_block(float %arg, float %arg1, float %arg2, float %arg3) #0 {		define amdgpu_ps void @if_after_kill_block(float %arg, float %arg1, float %arg2, float %arg3) #0 {
bb:		bb:
%tmp = fcmp ult float %arg1, 0.000000e+00		%tmp = fcmp ult float %arg1, 0.000000e+00
br i1 %tmp, label %bb3, label %bb4		br i1 %tmp, label %bb3, label %bb4

bb3: ; preds = %bb		bb3: ; preds = %bb
%cmp.arg = fcmp olt float %arg, 0.0		%cmp.arg = fcmp olt float %arg, 0.0
call void @llvm.amdgcn.kill(i1 %cmp.arg)		call void @llvm.amdgcn.kill(i1 %cmp.arg)
Show All 9 Lines	bb8: ; preds = %bb9, %bb4
store volatile i32 9, i32 addrspace(1)* undef		store volatile i32 9, i32 addrspace(1)* undef
ret void		ret void

bb9: ; preds = %bb4		bb9: ; preds = %bb4
ret void		ret void
}		}

; CHECK-LABEL: {{^}}cbranch_kill:		; CHECK-LABEL: {{^}}cbranch_kill:
		; CHECK: ; %bb.{{[0-9]+}}: ; %kill
		; CHECK-NEXT: s_andn2
		; CHECK-NEXT: s_cbranch_scc0 [[EXIT:BB[0-9]+_[0-9]+]]
; CHECK: ; %bb.{{[0-9]+}}: ; %export		; CHECK: ; %bb.{{[0-9]+}}: ; %export
; CHECK-NEXT: s_or_b64		; CHECK-NEXT: s_or_b64
; CHECK-NEXT: s_cbranch_execz [[EXIT:BB[0-9]+_[0-9]+]]
; CHECK: [[EXIT]]:		; CHECK: [[EXIT]]:
		; CHECK-NEXT: s_mov_b64 exec, 0
; CHECK-NEXT: exp null off, off, off, off done vm		; CHECK-NEXT: exp null off, off, off, off done vm
define amdgpu_ps void @cbranch_kill(i32 inreg %0, <2 x float> %1) {		define amdgpu_ps void @cbranch_kill(i32 inreg %0, <2 x float> %1) {
.entry:		.entry:
%val0 = extractelement <2 x float> %1, i32 0		%val0 = extractelement <2 x float> %1, i32 0
%val1 = extractelement <2 x float> %1, i32 1		%val1 = extractelement <2 x float> %1, i32 1
%p0 = call float @llvm.amdgcn.interp.p1(float %val0, i32 immarg 0, i32 immarg 1, i32 %0) #2		%p0 = call float @llvm.amdgcn.interp.p1(float %val0, i32 immarg 0, i32 immarg 1, i32 %0) #2
%sample = call float @llvm.amdgcn.image.sample.l.2darray.f32.f32(i32 1, float %p0, float %p0, float %p0, float 0.000000e+00, <8 x i32> undef, <4 x i32> undef, i1 false, i32 0, i32 0)		%sample = call float @llvm.amdgcn.image.sample.l.2darray.f32.f32(i32 1, float %p0, float %p0, float %p0, float 0.000000e+00, <8 x i32> undef, <4 x i32> undef, i1 false, i32 0, i32 0)
%cond0 = fcmp ugt float %sample, 0.000000e+00		%cond0 = fcmp ugt float %sample, 0.000000e+00
Show All 22 Lines	export:
%out.0 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %proxy.0.0, float %proxy.0.1) #2		%out.0 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %proxy.0.0, float %proxy.0.1) #2
%out.1 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %proxy.0.2, float %proxy.0.3) #2		%out.1 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %proxy.0.2, float %proxy.0.3) #2
call void @llvm.amdgcn.exp.compr.v2f16(i32 immarg 0, i32 immarg 15, <2 x half> %out.0, <2 x half> %out.1, i1 immarg true, i1 immarg true) #3		call void @llvm.amdgcn.exp.compr.v2f16(i32 immarg 0, i32 immarg 15, <2 x half> %out.0, <2 x half> %out.1, i1 immarg true, i1 immarg true) #3
ret void		ret void
}		}

; CHECK-LABEL: {{^}}complex_loop:		; CHECK-LABEL: {{^}}complex_loop:
; CHECK: s_mov_b64 exec, 0		; CHECK: s_mov_b64 exec, 0
; CHECK-NOT: exp null		; CHECK: exp null
define amdgpu_ps void @complex_loop(i32 inreg %cmpa, i32 %cmpb, i32 %cmpc) {		define amdgpu_ps void @complex_loop(i32 inreg %cmpa, i32 %cmpb, i32 %cmpc) {
.entry:		.entry:
%flaga = icmp sgt i32 %cmpa, 0		%flaga = icmp sgt i32 %cmpa, 0
br i1 %flaga, label %.lr.ph, label %._crit_edge		br i1 %flaga, label %.lr.ph, label %._crit_edge

.lr.ph:		.lr.ph:
br label %hdr		br label %hdr

▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll

	Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	}			}

	define amdgpu_ps { <4 x float> } @test_return_to_epilog_with_optimized_kill(float %val) #0 {			define amdgpu_ps { <4 x float> } @test_return_to_epilog_with_optimized_kill(float %val) #0 {
	; GCN-LABEL: name: test_return_to_epilog_with_optimized_kill			; GCN-LABEL: name: test_return_to_epilog_with_optimized_kill
	; GCN: bb.0.entry:			; GCN: bb.0.entry:
	; GCN: successors: %bb.1(0x40000000), %bb.4(0x40000000)			; GCN: successors: %bb.1(0x40000000), %bb.4(0x40000000)
	; GCN: liveins: $vgpr0			; GCN: liveins: $vgpr0
	; GCN: renamable $vgpr1 = nofpexcept V_RCP_F32_e32 $vgpr0, implicit $mode, implicit $exec			; GCN: renamable $vgpr1 = nofpexcept V_RCP_F32_e32 $vgpr0, implicit $mode, implicit $exec
				; GCN: $sgpr0_sgpr1 = S_MOV_B64 $exec
	; GCN: nofpexcept V_CMP_NGT_F32_e32 0, killed $vgpr1, implicit-def $vcc, implicit $mode, implicit $exec			; GCN: nofpexcept V_CMP_NGT_F32_e32 0, killed $vgpr1, implicit-def $vcc, implicit $mode, implicit $exec
	; GCN: $sgpr0_sgpr1 = S_AND_SAVEEXEC_B64 killed $vcc, implicit-def $exec, implicit-def $scc, implicit $exec			; GCN: $sgpr2_sgpr3 = S_AND_SAVEEXEC_B64 killed $vcc, implicit-def $exec, implicit-def $scc, implicit $exec
	; GCN: renamable $sgpr0_sgpr1 = S_XOR_B64 $exec, killed renamable $sgpr0_sgpr1, implicit-def dead $scc			; GCN: renamable $sgpr2_sgpr3 = S_XOR_B64 $exec, killed renamable $sgpr2_sgpr3, implicit-def dead $scc
	; GCN: S_CBRANCH_EXECZ %bb.4, implicit $exec			; GCN: S_CBRANCH_EXECZ %bb.4, implicit $exec
	; GCN: bb.1.flow.preheader:			; GCN: bb.1.flow.preheader:
	; GCN: successors: %bb.2(0x80000000)			; GCN: successors: %bb.2(0x80000000)
	; GCN: liveins: $vgpr0, $sgpr0_sgpr1			; GCN: liveins: $vgpr0, $sgpr0_sgpr1, $sgpr2_sgpr3
	; GCN: nofpexcept V_CMP_NGT_F32_e32 0, killed $vgpr0, implicit-def $vcc, implicit $mode, implicit $exec			; GCN: nofpexcept V_CMP_NGT_F32_e32 0, killed $vgpr0, implicit-def $vcc, implicit $mode, implicit $exec
	; GCN: renamable $sgpr2_sgpr3 = S_MOV_B64 0			; GCN: renamable $sgpr4_sgpr5 = S_MOV_B64 0
	; GCN: bb.2.flow:			; GCN: bb.2.flow:
	; GCN: successors: %bb.3(0x04000000), %bb.2(0x7c000000)			; GCN: successors: %bb.3(0x04000000), %bb.2(0x7c000000)
	; GCN: liveins: $vcc, $sgpr0_sgpr1, $sgpr2_sgpr3			; GCN: liveins: $vcc, $sgpr0_sgpr1, $sgpr2_sgpr3, $sgpr4_sgpr5
	; GCN: renamable $sgpr4_sgpr5 = S_AND_B64 $exec, renamable $vcc, implicit-def $scc			; GCN: renamable $sgpr6_sgpr7 = S_AND_B64 $exec, renamable $vcc, implicit-def $scc
	; GCN: renamable $sgpr2_sgpr3 = S_OR_B64 killed renamable $sgpr4_sgpr5, killed renamable $sgpr2_sgpr3, implicit-def $scc			; GCN: renamable $sgpr4_sgpr5 = S_OR_B64 killed renamable $sgpr6_sgpr7, killed renamable $sgpr4_sgpr5, implicit-def $scc
	; GCN: $exec = S_ANDN2_B64 $exec, renamable $sgpr2_sgpr3, implicit-def $scc			; GCN: $exec = S_ANDN2_B64 $exec, renamable $sgpr4_sgpr5, implicit-def $scc
	; GCN: S_CBRANCH_EXECNZ %bb.2, implicit $exec			; GCN: S_CBRANCH_EXECNZ %bb.2, implicit $exec
	; GCN: bb.3.Flow:			; GCN: bb.3.Flow:
	; GCN: successors: %bb.4(0x80000000)			; GCN: successors: %bb.4(0x80000000)
	; GCN: liveins: $sgpr0_sgpr1, $sgpr2_sgpr3			; GCN: liveins: $sgpr0_sgpr1, $sgpr2_sgpr3, $sgpr4_sgpr5
	; GCN: $exec = S_OR_B64 $exec, killed renamable $sgpr2_sgpr3, implicit-def $scc			; GCN: $exec = S_OR_B64 $exec, killed renamable $sgpr4_sgpr5, implicit-def $scc
	; GCN: bb.4.Flow1:			; GCN: bb.4.Flow1:
	; GCN: successors: %bb.5(0x40000000), %bb.6(0x40000000)			; GCN: successors: %bb.5(0x40000000)
	; GCN: liveins: $sgpr0_sgpr1			; GCN: liveins: $sgpr0_sgpr1, $sgpr2_sgpr3
	; GCN: renamable $sgpr0_sgpr1 = S_OR_SAVEEXEC_B64 killed renamable $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec			; GCN: renamable $sgpr2_sgpr3 = S_OR_SAVEEXEC_B64 killed renamable $sgpr2_sgpr3, implicit-def $exec, implicit-def $scc, implicit $exec
	; GCN: $exec = S_XOR_B64 $exec, renamable $sgpr0_sgpr1, implicit-def $scc			; GCN: $exec = S_XOR_B64 $exec, renamable $sgpr2_sgpr3, implicit-def $scc
	; GCN: S_CBRANCH_EXECZ %bb.6, implicit $exec
	; GCN: bb.5.kill0:			; GCN: bb.5.kill0:
				; GCN: successors: %bb.8(0x40000000), %bb.7(0x40000000)
				; GCN: liveins: $sgpr0_sgpr1, $sgpr2_sgpr3
				; GCN: dead renamable $sgpr0_sgpr1 = S_ANDN2_B64 killed renamable $sgpr0_sgpr1, $exec, implicit-def $scc
				; GCN: S_CBRANCH_SCC0 %bb.7, implicit $scc
				; GCN: bb.8.kill0:
	; GCN: successors: %bb.6(0x80000000)			; GCN: successors: %bb.6(0x80000000)
	; GCN: liveins: $sgpr0_sgpr1			; GCN: liveins: $sgpr2_sgpr3, $scc
	; GCN: $exec = S_MOV_B64 0			; GCN: $exec = S_MOV_B64 0
	; GCN: bb.6.end:			; GCN: bb.6.end:
	; GCN: successors: %bb.7(0x40000000), %bb.8(0x40000000)			; GCN: successors: %bb.9(0x80000000)
	; GCN: liveins: $sgpr0_sgpr1			; GCN: liveins: $sgpr2_sgpr3
	; GCN: $exec = S_OR_B64 $exec, killed renamable $sgpr0_sgpr1, implicit-def $scc			; GCN: $exec = S_OR_B64 $exec, killed renamable $sgpr2_sgpr3, implicit-def $scc
	; GCN: S_CBRANCH_EXECZ %bb.7, implicit $exec			; GCN: S_BRANCH %bb.9
	; GCN: S_BRANCH %bb.8
	; GCN: bb.7:			; GCN: bb.7:
				; GCN: $exec = S_MOV_B64 0
	; GCN: EXP_DONE 9, undef $vgpr0, undef $vgpr0, undef $vgpr0, undef $vgpr0, 1, 0, 0, implicit $exec			; GCN: EXP_DONE 9, undef $vgpr0, undef $vgpr0, undef $vgpr0, undef $vgpr0, 1, 0, 0, implicit $exec
	; GCN: S_ENDPGM 0			; GCN: S_ENDPGM 0
	; GCN: bb.8:			; GCN: bb.9:
	entry:			entry:
	%.i0 = fdiv reassoc nnan nsz arcp contract afn float 1.000000e+00, %val			%.i0 = fdiv reassoc nnan nsz arcp contract afn float 1.000000e+00, %val
	%cmp0 = fcmp olt float %.i0, 0.000000e+00			%cmp0 = fcmp olt float %.i0, 0.000000e+00
	br i1 %cmp0, label %kill0, label %flow			br i1 %cmp0, label %kill0, label %flow

	kill0: ; preds = %entry			kill0: ; preds = %entry
	call void @llvm.amdgcn.kill(i1 false)			call void @llvm.amdgcn.kill(i1 false)
	br label %end			br label %end
	Show All 16 Lines

llvm/test/CodeGen/AMDGPU/vcmpx-exec-war-hazard.mir

	# RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs -run-pass si-insert-skips,post-RA-hazard-rec -o - %s \| FileCheck -check-prefix=GCN %s			# RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs -run-pass post-RA-hazard-rec -o - %s \| FileCheck -check-prefix=GCN %s

	# GCN-LABEL: name: hazard_vcmpx_smov_exec_lo			# GCN-LABEL: name: hazard_vcmpx_smov_exec_lo
	# GCN: $sgpr0 = S_MOV_B32 $exec_lo			# GCN: $sgpr0 = S_MOV_B32 $exec_lo
	# GCN-NEXT: S_WAITCNT_DEPCTR 65534			# GCN-NEXT: S_WAITCNT_DEPCTR 65534
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: hazard_vcmpx_smov_exec_lo			name: hazard_vcmpx_smov_exec_lo
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	$sgpr0 = S_MOV_B32 $exec_lo			$sgpr0 = S_MOV_B32 $exec_lo
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: hazard_vcmpx_smov_exec			# GCN-LABEL: name: hazard_vcmpx_smov_exec
	# GCN: $sgpr0_sgpr1 = S_MOV_B64 $exec			# GCN: $sgpr0_sgpr1 = S_MOV_B64 $exec
	# GCN-NEXT: S_WAITCNT_DEPCTR 65534			# GCN-NEXT: S_WAITCNT_DEPCTR 65534
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: hazard_vcmpx_smov_exec			name: hazard_vcmpx_smov_exec
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	$sgpr0_sgpr1 = S_MOV_B64 $exec			$sgpr0_sgpr1 = S_MOV_B64 $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: no_hazard_vcmpx_vmov_exec_lo			# GCN-LABEL: name: no_hazard_vcmpx_vmov_exec_lo
	# GCN: $vgpr0 = V_MOV_B32_e32 $exec_lo, implicit $exec			# GCN: $vgpr0 = V_MOV_B32_e32 $exec_lo, implicit $exec
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: no_hazard_vcmpx_vmov_exec_lo			name: no_hazard_vcmpx_vmov_exec_lo
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 $exec_lo, implicit $exec			$vgpr0 = V_MOV_B32_e32 $exec_lo, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: no_hazard_vcmpx_valu_impuse_exec			# GCN-LABEL: name: no_hazard_vcmpx_valu_impuse_exec
	# GCN: $vgpr0 = V_MOV_B32_e32 0, implicit $exec			# GCN: $vgpr0 = V_MOV_B32_e32 0, implicit $exec
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: no_hazard_vcmpx_valu_impuse_exec			name: no_hazard_vcmpx_valu_impuse_exec
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: no_hazard_vcmpx_smov_exec_lo_valu_writes_sgpr_imp			# GCN-LABEL: name: no_hazard_vcmpx_smov_exec_lo_valu_writes_sgpr_imp
	# GCN: $sgpr0 = S_MOV_B32 $exec_lo			# GCN: $sgpr0 = S_MOV_B32 $exec_lo
	# GCN-NEXT: $vgpr0 = V_ADDC_U32_e32 0, $vgpr0, implicit-def $vcc, implicit $vcc, implicit $exec			# GCN-NEXT: $vgpr0 = V_ADDC_U32_e32 0, $vgpr0, implicit-def $vcc, implicit $vcc, implicit $exec
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: no_hazard_vcmpx_smov_exec_lo_valu_writes_sgpr_imp			name: no_hazard_vcmpx_smov_exec_lo_valu_writes_sgpr_imp
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	$sgpr0 = S_MOV_B32 $exec_lo			$sgpr0 = S_MOV_B32 $exec_lo
	$vgpr0 = V_ADDC_U32_e32 0, $vgpr0, implicit-def $vcc, implicit $vcc, implicit $exec			$vgpr0 = V_ADDC_U32_e32 0, $vgpr0, implicit-def $vcc, implicit $vcc, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: no_hazard_vcmpx_smov_exec_lo_valu_writes_sgpr_exp			# GCN-LABEL: name: no_hazard_vcmpx_smov_exec_lo_valu_writes_sgpr_exp
	# GCN: $sgpr0 = S_MOV_B32 $exec_lo			# GCN: $sgpr0 = S_MOV_B32 $exec_lo
	# GCN-NEXT: $sgpr0_sgpr1 = V_CMP_EQ_U32_e64 $vgpr0, 0, implicit $exec			# GCN-NEXT: $sgpr0_sgpr1 = V_CMP_EQ_U32_e64 $vgpr0, 0, implicit $exec
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: no_hazard_vcmpx_smov_exec_lo_valu_writes_sgpr_exp			name: no_hazard_vcmpx_smov_exec_lo_valu_writes_sgpr_exp
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	$sgpr0 = S_MOV_B32 $exec_lo			$sgpr0 = S_MOV_B32 $exec_lo
	$sgpr0_sgpr1 = V_CMP_EQ_U32_e64 $vgpr0, 0, implicit $exec			$sgpr0_sgpr1 = V_CMP_EQ_U32_e64 $vgpr0, 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: no_hazard_vcmpx_smov_exec_lo_depctr_fffe			# GCN-LABEL: name: no_hazard_vcmpx_smov_exec_lo_depctr_fffe
	# GCN: $sgpr0 = S_MOV_B32 $exec_lo			# GCN: $sgpr0 = S_MOV_B32 $exec_lo
	# GCN-NEXT: S_WAITCNT_DEPCTR 65534			# GCN-NEXT: S_WAITCNT_DEPCTR 65534
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: no_hazard_vcmpx_smov_exec_lo_depctr_fffe			name: no_hazard_vcmpx_smov_exec_lo_depctr_fffe
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	$sgpr0 = S_MOV_B32 $exec_lo			$sgpr0 = S_MOV_B32 $exec_lo
	S_WAITCNT_DEPCTR 65534			S_WAITCNT_DEPCTR 65534
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: no_hazard_vcmpx_smov_exec_lo_depctr_ffff			# GCN-LABEL: name: no_hazard_vcmpx_smov_exec_lo_depctr_ffff
	# GCN: $sgpr0 = S_MOV_B32 $exec_lo			# GCN: $sgpr0 = S_MOV_B32 $exec_lo
	# GCN-NEXT: S_WAITCNT_DEPCTR 65535			# GCN-NEXT: S_WAITCNT_DEPCTR 65535
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: no_hazard_vcmpx_smov_exec_lo_depctr_ffff			name: no_hazard_vcmpx_smov_exec_lo_depctr_ffff
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	$sgpr0 = S_MOV_B32 $exec_lo			$sgpr0 = S_MOV_B32 $exec_lo
	S_WAITCNT_DEPCTR 65535			S_WAITCNT_DEPCTR 65535
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: hazard_vcmpx_smov_exec_lo_depctr_effe			# GCN-LABEL: name: hazard_vcmpx_smov_exec_lo_depctr_effe
	# GCN: $sgpr0 = S_MOV_B32 $exec_lo			# GCN: $sgpr0 = S_MOV_B32 $exec_lo
	# GCN: S_WAITCNT_DEPCTR 65534			# GCN: S_WAITCNT_DEPCTR 65534
	# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32			# GCN-NEXT: V_CMPX_LE_F32_nosdst_e32
	---			---
	name: hazard_vcmpx_smov_exec_lo_depctr_effe			name: hazard_vcmpx_smov_exec_lo_depctr_effe
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	$sgpr0 = S_MOV_B32 $exec_lo			$sgpr0 = S_MOV_B32 $exec_lo
	S_WAITCNT_DEPCTR 61438			S_WAITCNT_DEPCTR 61438
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

llvm/test/CodeGen/AMDGPU/vcmpx-permlane-hazard.mir

	# RUN: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs -run-pass si-insert-skips,post-RA-hazard-rec -o - %s \| FileCheck -check-prefix=GCN %s			# RUN: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs -run-pass post-RA-hazard-rec -o - %s \| FileCheck -check-prefix=GCN %s

	# GCN-LABEL: name: hazard_vcmpx_permlane16			# GCN-LABEL: name: hazard_vcmpx_permlane16
	# GCN: V_CMPX_LE_F32_nosdst_e32			# GCN: V_CMPX_LE_F32_nosdst_e32
	# GCN: S_ADD_U32			# GCN: S_ADD_U32
	# GCN-NEXT: $vgpr1 = V_MOV_B32_e32 killed $vgpr1, implicit $exec			# GCN-NEXT: $vgpr1 = V_MOV_B32_e32 killed $vgpr1, implicit $exec
	# GCN-NEXT: V_PERMLANE16_B32_e64			# GCN-NEXT: V_PERMLANE16_B32_e64
	---			---
	name: hazard_vcmpx_permlane16			name: hazard_vcmpx_permlane16
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	$vgpr1 = IMPLICIT_DEF			$vgpr1 = IMPLICIT_DEF
	$vgpr2 = IMPLICIT_DEF			$vgpr2 = IMPLICIT_DEF
	$sgpr0 = IMPLICIT_DEF			$sgpr0 = IMPLICIT_DEF
	$sgpr1 = S_ADD_U32 $sgpr0, 0, implicit-def $scc			$sgpr1 = S_ADD_U32 $sgpr0, 0, implicit-def $scc
	$vgpr1 = V_PERMLANE16_B32_e64 0, killed $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, $vgpr1, 0, implicit $exec			$vgpr1 = V_PERMLANE16_B32_e64 0, killed $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, $vgpr1, 0, implicit $exec
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: hazard_vcmpx_permlanex16			# GCN-LABEL: name: hazard_vcmpx_permlanex16
	# GCN: V_CMPX_LE_F32_nosdst_e32			# GCN: V_CMPX_LE_F32_nosdst_e32
	# GCN: $vgpr1 = V_MOV_B32_e32 killed $vgpr1, implicit $exec			# GCN: $vgpr1 = V_MOV_B32_e32 killed $vgpr1, implicit $exec
	# GCN-NEXT: V_PERMLANEX16_B32_e64			# GCN-NEXT: V_PERMLANEX16_B32_e64
	---			---
	name: hazard_vcmpx_permlanex16			name: hazard_vcmpx_permlanex16
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	$vgpr1 = IMPLICIT_DEF			$vgpr1 = IMPLICIT_DEF
	$vgpr2 = IMPLICIT_DEF			$vgpr2 = IMPLICIT_DEF
	$sgpr0 = IMPLICIT_DEF			$sgpr0 = IMPLICIT_DEF
	$sgpr1 = IMPLICIT_DEF			$sgpr1 = IMPLICIT_DEF
	$vgpr1 = V_PERMLANEX16_B32_e64 0, killed $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, $vgpr1, 0, implicit $exec			$vgpr1 = V_PERMLANEX16_B32_e64 0, killed $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, $vgpr1, 0, implicit $exec
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: hazard_vcmpx_permlane16_v_nop			# GCN-LABEL: name: hazard_vcmpx_permlane16_v_nop
	# GCN: V_CMPX_LE_F32_nosdst_e32			# GCN: V_CMPX_LE_F32_nosdst_e32
	# GCN: V_NOP			# GCN: V_NOP
	# GCN-NEXT: $vgpr1 = V_MOV_B32_e32 killed $vgpr1, implicit $exec			# GCN-NEXT: $vgpr1 = V_MOV_B32_e32 killed $vgpr1, implicit $exec
	# GCN-NEXT: V_PERMLANE16_B32_e64			# GCN-NEXT: V_PERMLANE16_B32_e64
	---			---
	name: hazard_vcmpx_permlane16_v_nop			name: hazard_vcmpx_permlane16_v_nop
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	$vgpr1 = IMPLICIT_DEF			$vgpr1 = IMPLICIT_DEF
	$vgpr2 = IMPLICIT_DEF			$vgpr2 = IMPLICIT_DEF
	$sgpr0 = IMPLICIT_DEF			$sgpr0 = IMPLICIT_DEF
	$sgpr1 = IMPLICIT_DEF			$sgpr1 = IMPLICIT_DEF
	V_NOP_e32 implicit $exec			V_NOP_e32 implicit $exec
	$vgpr1 = V_PERMLANE16_B32_e64 0, killed $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, $vgpr1, 0, implicit $exec			$vgpr1 = V_PERMLANE16_B32_e64 0, killed $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, $vgpr1, 0, implicit $exec
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: hazard_vcmpx_permlane16_far			# GCN-LABEL: name: hazard_vcmpx_permlane16_far
	# GCN: V_CMPX_LE_F32_nosdst_e32			# GCN: V_CMPX_LE_F32_nosdst_e32
	# GCN: $vgpr1 = V_MOV_B32_e32 killed $vgpr1, implicit $exec			# GCN: $vgpr1 = V_MOV_B32_e32 killed $vgpr1, implicit $exec
	# GCN-NEXT: V_PERMLANE16_B32_e64			# GCN-NEXT: V_PERMLANE16_B32_e64
	---			---
	name: hazard_vcmpx_permlane16_far			name: hazard_vcmpx_permlane16_far
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	$vgpr1 = IMPLICIT_DEF			$vgpr1 = IMPLICIT_DEF
	$vgpr2 = IMPLICIT_DEF			$vgpr2 = IMPLICIT_DEF
	$sgpr0 = IMPLICIT_DEF			$sgpr0 = IMPLICIT_DEF
	$sgpr1 = IMPLICIT_DEF			$sgpr1 = IMPLICIT_DEF
	V_NOP_e32 implicit $exec			V_NOP_e32 implicit $exec
	Show All 14 Lines
	# GCN: V_ADD_F32			# GCN: V_ADD_F32
	# GCN-NEXT: V_PERMLANE16_B32_e64			# GCN-NEXT: V_PERMLANE16_B32_e64
	---			---
	name: hazard_vcmpx_permlane16_no_hazard			name: hazard_vcmpx_permlane16_no_hazard
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	$vgpr1 = IMPLICIT_DEF			$vgpr1 = IMPLICIT_DEF
	$sgpr0 = IMPLICIT_DEF			$sgpr0 = IMPLICIT_DEF
	$sgpr1 = IMPLICIT_DEF			$sgpr1 = IMPLICIT_DEF
	$vgpr2 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec			$vgpr2 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
	$vgpr1 = V_PERMLANE16_B32_e64 0, killed $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, $vgpr1, 0, implicit $exec			$vgpr1 = V_PERMLANE16_B32_e64 0, killed $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, $vgpr1, 0, implicit $exec
	S_ENDPGM 0			S_ENDPGM 0
	...			...

	# GCN-LABEL: name: hazard_vcmpx_permlane16_undef_src			# GCN-LABEL: name: hazard_vcmpx_permlane16_undef_src
	# GCN: V_CMPX_LE_F32_nosdst_e32			# GCN: V_CMPX_LE_F32_nosdst_e32
	# GCN: S_ADD_U32			# GCN: S_ADD_U32
	# GCN-NEXT: dead $vgpr1 = V_MOV_B32_e32 undef $vgpr1, implicit $exec			# GCN-NEXT: dead $vgpr1 = V_MOV_B32_e32 undef $vgpr1, implicit $exec
	# GCN-NEXT: V_PERMLANE16_B32_e64			# GCN-NEXT: V_PERMLANE16_B32_e64
	---			---
	name: hazard_vcmpx_permlane16_undef_src			name: hazard_vcmpx_permlane16_undef_src
	body: \|			body: \|
	bb.0:			bb.0:
	successors: %bb.1			successors: %bb.1
	$vgpr0 = V_MOV_B32_e32 0, implicit $exec			$vgpr0 = V_MOV_B32_e32 0, implicit $exec
	SI_KILL_F32_COND_IMM_TERMINATOR $vgpr0, 0, 3, implicit-def $exec, implicit-def $vcc, implicit-def $scc, implicit $exec			V_CMPX_LE_F32_nosdst_e32 0, $vgpr0, implicit-def $exec, implicit $mode, implicit $exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	$vgpr2 = IMPLICIT_DEF			$vgpr2 = IMPLICIT_DEF
	$sgpr0 = IMPLICIT_DEF			$sgpr0 = IMPLICIT_DEF
	$sgpr1 = S_ADD_U32 $sgpr0, 0, implicit-def $scc			$sgpr1 = S_ADD_U32 $sgpr0, 0, implicit-def $scc
	$vgpr1 = V_PERMLANE16_B32_e64 0, undef $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, undef $vgpr1, 0, implicit $exec			$vgpr1 = V_PERMLANE16_B32_e64 0, undef $vgpr1, 0, killed $sgpr1, 0, killed $sgpr0, undef $vgpr1, 0, implicit $exec
	S_ENDPGM 0			S_ENDPGM 0
	...			...

llvm/test/CodeGen/AMDGPU/wave32.ll

Show All 28 Lines	define amdgpu_kernel void @test_vopc_f32(float addrspace(1)* %arg) {
%gep = getelementptr inbounds float, float addrspace(1)* %arg, i32 %lid		%gep = getelementptr inbounds float, float addrspace(1)* %arg, i32 %lid
%load = load float, float addrspace(1)* %gep, align 4		%load = load float, float addrspace(1)* %gep, align 4
%cmp = fcmp ugt float %load, 0.0		%cmp = fcmp ugt float %load, 0.0
%sel = select i1 %cmp, float 1.0, float 2.0		%sel = select i1 %cmp, float 1.0, float 2.0
store float %sel, float addrspace(1)* %gep, align 4		store float %sel, float addrspace(1)* %gep, align 4
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_vopc_vcmpx:		; GCN-LABEL: {{^}}test_vopc_vcmp:
; GFX1032: v_cmpx_le_f32_e32 0, v{{[0-9]+}}		; GFX1032: v_cmp_nge_f32_e32 vcc_lo, 0, v{{[0-9]+}}
; GFX1064: v_cmpx_le_f32_e32 0, v{{[0-9]+}}		; GFX1064: v_cmp_nge_f32_e32 vcc, 0, v{{[0-9]+}}
define amdgpu_ps void @test_vopc_vcmpx(float %x) {		define amdgpu_ps void @test_vopc_vcmp(float %x) {
%cmp = fcmp oge float %x, 0.0		%cmp = fcmp oge float %x, 0.0
call void @llvm.amdgcn.kill(i1 %cmp)		call void @llvm.amdgcn.kill(i1 %cmp)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_vopc_2xf16:		; GCN-LABEL: {{^}}test_vopc_2xf16:
; GFX1032: v_cmp_le_f16_sdwa [[SC:vcc_lo\|s[0-9]+]], {{[vs][0-9]+}}, v{{[0-9]+}} src0_sel:WORD_1 src1_sel:DWORD		; GFX1032: v_cmp_le_f16_sdwa [[SC:vcc_lo\|s[0-9]+]], {{[vs][0-9]+}}, v{{[0-9]+}} src0_sel:WORD_1 src1_sel:DWORD
; GFX1032: v_cndmask_b32_e32 v{{[0-9]+}}, 0x3c003c00, v{{[0-9]+}}, [[SC]]		; GFX1032: v_cndmask_b32_e32 v{{[0-9]+}}, 0x3c003c00, v{{[0-9]+}}, [[SC]]
▲ Show 20 Lines • Show All 604 Lines • ▼ Show 20 Lines
; GFX1032: s_mov_b32 exec_lo, 0		; GFX1032: s_mov_b32 exec_lo, 0
; GFX1064: s_mov_b64 exec, 0		; GFX1064: s_mov_b64 exec, 0
define amdgpu_ps void @test_kill_i1_terminator_float() #0 {		define amdgpu_ps void @test_kill_i1_terminator_float() #0 {
call void @llvm.amdgcn.kill(i1 false)		call void @llvm.amdgcn.kill(i1 false)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_kill_i1_terminator_i1:		; GCN-LABEL: {{^}}test_kill_i1_terminator_i1:
		; GFX1032: s_mov_b32 [[LIVE:s[0-9]+]], exec_lo
; GFX1032: s_or_b32 [[OR:s[0-9]+]],		; GFX1032: s_or_b32 [[OR:s[0-9]+]],
; GFX1032: s_and_b32 exec_lo, exec_lo, [[OR]]		; GFX1032: s_xor_b32 [[KILL:s[0-9]+]], [[OR]], exec_lo
		; GFX1032: s_andn2_b32 [[MASK:s[0-9]+]], [[LIVE]], [[KILL]]
		; GFX1032: s_and_b32 exec_lo, exec_lo, [[MASK]]
		; GFX1064: s_mov_b64 [[LIVE:s\[[0-9:]+\]]], exec
; GFX1064: s_or_b64 [[OR:s\[[0-9:]+\]]],		; GFX1064: s_or_b64 [[OR:s\[[0-9:]+\]]],
; GFX1064: s_and_b64 exec, exec, [[OR]]		; GFX1064: s_xor_b64 [[KILL:s\[[0-9:]+\]]], [[OR]], exec
		; GFX1064: s_andn2_b64 [[MASK:s\[[0-9:]+\]]], [[LIVE]], [[KILL]]
		; GFX1064: s_and_b64 exec, exec, [[MASK]]
define amdgpu_gs void @test_kill_i1_terminator_i1(i32 %a, i32 %b, i32 %c, i32 %d) #0 {		define amdgpu_gs void @test_kill_i1_terminator_i1(i32 %a, i32 %b, i32 %c, i32 %d) #0 {
%c1 = icmp slt i32 %a, %b		%c1 = icmp slt i32 %a, %b
%c2 = icmp slt i32 %c, %d		%c2 = icmp slt i32 %c, %d
%x = or i1 %c1, %c2		%x = or i1 %c1, %c2
call void @llvm.amdgcn.kill(i1 %x)		call void @llvm.amdgcn.kill(i1 %x)
		call void @llvm.amdgcn.exp.f32(i32 0, i32 0, float 0.0, float 0.0, float 0.0, float 0.0, i1 false, i1 false)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_loop_vcc:		; GCN-LABEL: {{^}}test_loop_vcc:
; GFX1032: v_cmp_lt_f32_e32 vcc_lo,		; GFX1032: v_cmp_lt_f32_e32 vcc_lo,
; GFX1064: v_cmp_lt_f32_e32 vcc,		; GFX1064: v_cmp_lt_f32_e32 vcc,
; GCN: s_cbranch_vccz		; GCN: s_cbranch_vccz
define amdgpu_ps <4 x float> @test_loop_vcc(<4 x float> %in) #0 {		define amdgpu_ps <4 x float> @test_loop_vcc(<4 x float> %in) #0 {
▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines
define amdgpu_kernel void @test_intr_icmp_i32(i32 addrspace(1)* %out, i32 %src) {		define amdgpu_kernel void @test_intr_icmp_i32(i32 addrspace(1)* %out, i32 %src) {
%result = call i32 @llvm.amdgcn.icmp.i32.i32(i32 %src, i32 100, i32 32)		%result = call i32 @llvm.amdgcn.icmp.i32.i32(i32 %src, i32 100, i32 32)
store i32 %result, i32 addrspace(1)* %out		store i32 %result, i32 addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_wqm_vote:		; GCN-LABEL: {{^}}test_wqm_vote:
; GFX1032: v_cmp_neq_f32_e32 vcc_lo, 0		; GFX1032: v_cmp_neq_f32_e32 vcc_lo, 0
		; GFX1032: s_mov_b32 [[LIVE:s[0-9]+]], exec_lo
; GFX1032: s_wqm_b32 [[WQM:s[0-9]+]], vcc_lo		; GFX1032: s_wqm_b32 [[WQM:s[0-9]+]], vcc_lo
; GFX1032: s_and_b32 exec_lo, exec_lo, [[WQM]]		; GFX1032: s_xor_b32 [[KILL:s[0-9]+]], [[WQM]], exec_lo
		; GFX1032: s_andn2_b32 [[MASK:s[0-9]+]], [[LIVE]], [[KILL]]
		; GFX1032: s_and_b32 exec_lo, exec_lo, [[MASK]]
; GFX1064: v_cmp_neq_f32_e32 vcc, 0		; GFX1064: v_cmp_neq_f32_e32 vcc, 0
; GFX1064: s_wqm_b64 [[WQM:s\[[0-9:]+\]]], vcc{{$}}		; GFX1064: s_mov_b64 [[LIVE:s\[[0-9:]+\]]], exec
; GFX1064: s_and_b64 exec, exec, [[WQM]]		; GFX1064: s_wqm_b64 [[WQM:s\[[0-9:]+\]]], vcc
		; GFX1064: s_xor_b64 [[KILL:s\[[0-9:]+\]]], [[WQM]], exec
		; GFX1064: s_andn2_b64 [[MASK:s\[[0-9:]+\]]], [[LIVE]], [[KILL]]
		; GFX1064: s_and_b64 exec, exec, [[MASK]]
define amdgpu_ps void @test_wqm_vote(float %a) {		define amdgpu_ps void @test_wqm_vote(float %a) {
%c1 = fcmp une float %a, 0.0		%c1 = fcmp une float %a, 0.0
%c2 = call i1 @llvm.amdgcn.wqm.vote(i1 %c1)		%c2 = call i1 @llvm.amdgcn.wqm.vote(i1 %c1)
call void @llvm.amdgcn.kill(i1 %c2)		call void @llvm.amdgcn.kill(i1 %c2)
		call void @llvm.amdgcn.exp.f32(i32 0, i32 0, float 0.0, float 0.0, float 0.0, float 0.0, i1 false, i1 false)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_branch_true:		; GCN-LABEL: {{^}}test_branch_true:
; GFX1032: s_mov_b32 vcc_lo, exec_lo		; GFX1032: s_mov_b32 vcc_lo, exec_lo
; GFX1064: s_mov_b64 vcc, exec		; GFX1064: s_mov_b64 vcc, exec
define amdgpu_kernel void @test_branch_true() #2 {		define amdgpu_kernel void @test_branch_true() #2 {
entry:		entry:
▲ Show 20 Lines • Show All 272 Lines • ▼ Show 20 Lines
declare i64 @llvm.amdgcn.icmp.i64.i32(i32, i32, i32)		declare i64 @llvm.amdgcn.icmp.i64.i32(i32, i32, i32)
declare i32 @llvm.amdgcn.fcmp.i32.f32(float, float, i32)		declare i32 @llvm.amdgcn.fcmp.i32.f32(float, float, i32)
declare i32 @llvm.amdgcn.icmp.i32.i32(i32, i32, i32)		declare i32 @llvm.amdgcn.icmp.i32.i32(i32, i32, i32)
declare void @llvm.amdgcn.kill(i1)		declare void @llvm.amdgcn.kill(i1)
declare i1 @llvm.amdgcn.wqm.vote(i1)		declare i1 @llvm.amdgcn.wqm.vote(i1)
declare i1 @llvm.amdgcn.ps.live()		declare i1 @llvm.amdgcn.ps.live()
declare i64 @llvm.cttz.i64(i64, i1)		declare i64 @llvm.cttz.i64(i64, i1)
declare i32 @llvm.cttz.i32(i32, i1)		declare i32 @llvm.cttz.i32(i32, i1)
		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #5

attributes #0 = { nounwind readnone speculatable }		attributes #0 = { nounwind readnone speculatable }
attributes #1 = { nounwind }		attributes #1 = { nounwind }
attributes #2 = { nounwind readnone optnone noinline }		attributes #2 = { nounwind readnone optnone noinline }
attributes #3 = { "target-features"="+wavefrontsize32" }		attributes #3 = { "target-features"="+wavefrontsize32" }
attributes #4 = { "target-features"="+wavefrontsize64" }		attributes #4 = { "target-features"="+wavefrontsize64" }
		attributes #5 = { inaccessiblememonly nounwind }

llvm/test/CodeGen/AMDGPU/wqm.ll

	Show First 20 Lines • Show All 570 Lines • ▼ Show 20 Lines
	;CHECK-LABEL: {{^}}test_kill_0:			;CHECK-LABEL: {{^}}test_kill_0:
	;CHECK-NEXT: ; %main_body			;CHECK-NEXT: ; %main_body
	;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec			;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
	;CHECK-NEXT: s_wqm_b64 exec, exec			;CHECK-NEXT: s_wqm_b64 exec, exec
	;CHECK: s_and_b64 exec, exec, [[ORIG]]			;CHECK: s_and_b64 exec, exec, [[ORIG]]
	;CHECK: image_sample			;CHECK: image_sample
	;CHECK: buffer_store_dword			;CHECK: buffer_store_dword
	;CHECK: s_wqm_b64 exec, exec			;CHECK: s_wqm_b64 exec, exec
	;CHECK: v_cmpx_			;CHECK: v_cmp_
	;CHECK: image_sample			;CHECK: image_sample
	;CHECK: s_and_b64 exec, exec, [[ORIG]]			;CHECK: s_and_b64 exec, exec, [[ORIG]]
	;CHECK: image_sample			;CHECK: image_sample
	;CHECK: buffer_store_dword			;CHECK: buffer_store_dword
	define amdgpu_ps <4 x float> @test_kill_0(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, <2 x i32> %idx, <2 x float> %data, float %coord, float %coord2, float %z) {			define amdgpu_ps <4 x float> @test_kill_0(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, <2 x i32> %idx, <2 x float> %data, float %coord, float %coord2, float %z) {
	main_body:			main_body:
	%tex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i1 false, i32 0, i32 0) #0			%tex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i1 false, i32 0, i32 0) #0
	%idx.0 = extractelement <2 x i32> %idx, i32 0			%idx.0 = extractelement <2 x i32> %idx, i32 0
	Show All 18 Lines
	;			;
	; CHECK-LABEL: {{^}}test_kill_1:			; CHECK-LABEL: {{^}}test_kill_1:
	; CHECK-NEXT: ; %main_body			; CHECK-NEXT: ; %main_body
	; CHECK: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec			; CHECK: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
	; CHECK: s_wqm_b64 exec, exec			; CHECK: s_wqm_b64 exec, exec
	; CHECK: image_sample			; CHECK: image_sample
	; CHECK: s_and_b64 exec, exec, [[ORIG]]			; CHECK: s_and_b64 exec, exec, [[ORIG]]
	; CHECK: image_sample			; CHECK: image_sample
	; CHECK: buffer_store_dword
	; CHECK-NOT: wqm			; CHECK-NOT: wqm
	; CHECK: v_cmpx_			; CHECK-DAG: buffer_store_dword
				; CHECK-DAG: v_cmp_
	define amdgpu_ps <4 x float> @test_kill_1(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, i32 %idx, float %data, float %coord, float %coord2, float %z) {			define amdgpu_ps <4 x float> @test_kill_1(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, i32 %idx, float %data, float %coord, float %coord2, float %z) {
	main_body:			main_body:
	%tex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i1 false, i32 0, i32 0) #0			%tex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i1 false, i32 0, i32 0) #0
	%tex0 = extractelement <4 x float> %tex, i32 0			%tex0 = extractelement <4 x float> %tex, i32 0
	%dtex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %tex0, <8 x i32> %rsrc, <4 x i32> %sampler, i1 false, i32 0, i32 0) #0			%dtex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %tex0, <8 x i32> %rsrc, <4 x i32> %sampler, i1 false, i32 0, i32 0) #0

	call void @llvm.amdgcn.raw.buffer.store.f32(float %data, <4 x i32> undef, i32 0, i32 0, i32 0)			call void @llvm.amdgcn.raw.buffer.store.f32(float %data, <4 x i32> undef, i32 0, i32 0, i32 0)

	▲ Show 20 Lines • Show All 233 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Move kill lowering to WQM pass and add live mask trackingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 317727

llvm/lib/Target/AMDGPU/SIInsertSkips.cpp

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll

llvm/test/CodeGen/AMDGPU/early-term.mir

llvm/test/CodeGen/AMDGPU/insert-skips-kill-uncond.mir

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.vote.ll

llvm/test/CodeGen/AMDGPU/skip-if-dead.ll

llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll

llvm/test/CodeGen/AMDGPU/vcmpx-exec-war-hazard.mir

llvm/test/CodeGen/AMDGPU/vcmpx-permlane-hazard.mir

llvm/test/CodeGen/AMDGPU/wave32.ll

llvm/test/CodeGen/AMDGPU/wqm.ll

[AMDGPU] Move kill lowering to WQM pass and add live mask tracking
ClosedPublic