Diff 220868

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 1,233 Lines • ▼ Show 20 Lines

	// Pixel shaders only: whether the current pixel is live (i.e. not a helper			// Pixel shaders only: whether the current pixel is live (i.e. not a helper
	// invocation for derivative computation).			// invocation for derivative computation).
	def int_amdgcn_ps_live : Intrinsic <			def int_amdgcn_ps_live : Intrinsic <
	[llvm_i1_ty],			[llvm_i1_ty],
	[],			[],
	[IntrNoMem]>;			[IntrNoMem]>;

				// Like ps.live, but cannot be moved by LICM.
				def int_amdgcn_wqm_helper : Intrinsic <[llvm_i1_ty], [], []>;
				arsenmUnsubmitted Not Done Reply Inline Actions I assume this needs to be convergent arsenm: I assume this needs to be convergent
				critsonAuthorUnsubmitted Done Reply Inline Actions Could be, my understanding is that without flags the intrinsic is marked "has side effects", which is correct as then it will not moved by LICM or removed by CSE. critson: Could be, my understanding is that without flags the intrinsic is marked "has side effects"…
				nhaehnleUnsubmitted Not Done Reply Inline Actions Rethinking this, convergent isn't correct here, because there is no implied cross-thread communication. Rather, the semantics are that amdgcn.wqm.helper reads from some hidden memory that is written to by wqm.demote. So by that logic, this should arguably be ReadInaccessibleMemOnly. nhaehnle: Rethinking this, convergent isn't correct here, because there is no implied cross-thread…

	def int_amdgcn_mbcnt_lo :			def int_amdgcn_mbcnt_lo :
	GCCBuiltin<"__builtin_amdgcn_mbcnt_lo">,			GCCBuiltin<"__builtin_amdgcn_mbcnt_lo">,
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], [IntrNoMem]>;			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], [IntrNoMem]>;

	def int_amdgcn_mbcnt_hi :			def int_amdgcn_mbcnt_hi :
	GCCBuiltin<"__builtin_amdgcn_mbcnt_hi">,			GCCBuiltin<"__builtin_amdgcn_mbcnt_hi">,
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], [IntrNoMem]>;			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], [IntrNoMem]>;

	▲ Show 20 Lines • Show All 196 Lines • ▼ Show 20 Lines
	// the function.			// the function.
	def int_amdgcn_wqm_vote : Intrinsic<[llvm_i1_ty],			def int_amdgcn_wqm_vote : Intrinsic<[llvm_i1_ty],
	[llvm_i1_ty], [IntrNoMem, IntrConvergent]			[llvm_i1_ty], [IntrNoMem, IntrConvergent]
	>;			>;

	// If false, set EXEC=0 for the current thread until the end of program.			// If false, set EXEC=0 for the current thread until the end of program.
	def int_amdgcn_kill : Intrinsic<[], [llvm_i1_ty], []>;			def int_amdgcn_kill : Intrinsic<[], [llvm_i1_ty], []>;

				// If false, mark all active lanes as helper lanes until the end of program.
				def int_amdgcn_wqm_demote : Intrinsic<[], [llvm_i1_ty], []>;
				arsenmUnsubmitted Not Done Reply Inline Actions Ditto, and nomem arsenm: Ditto, and nomem
				critsonAuthorUnsubmitted Not Done Reply Inline Actions Convergent maybe (as above), but not nomem. If this is marked nomem then it will be eaten by early CSE. Since this was modelled on kill, is there a reason we don't mark kill Convergent? critson: Convergent maybe (as above), but not nomem. If this is marked nomem then it will be eaten by…
				nhaehnleUnsubmitted Not Done Reply Inline Actions Following the logic above, this should not be convergent but WritesInaccessibleMemOnly. At least I think that captures the semantics most accurately. nhaehnle: Following the logic above, this should not be convergent but WritesInaccessibleMemOnly. At…

	// Copies the active channels of the source value to the destination value,			// Copies the active channels of the source value to the destination value,
	// with the guarantee that the source value is computed as if the entire			// with the guarantee that the source value is computed as if the entire
	// program were executed in Whole Wavefront Mode, i.e. with all channels			// program were executed in Whole Wavefront Mode, i.e. with all channels
	// enabled, with a few exceptions: - Phi nodes with require WWM return an			// enabled, with a few exceptions: - Phi nodes with require WWM return an
	// undefined value.			// undefined value.
	def int_amdgcn_wwm : Intrinsic<[llvm_any_ty],			def int_amdgcn_wwm : Intrinsic<[llvm_any_ty],
	[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable, IntrConvergent]			[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable, IntrConvergent]
	>;			>;
	▲ Show 20 Lines • Show All 374 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td

	Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
	def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_umax>;			def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_umax>;
	def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_and>;			def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_and>;
	def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_or>;			def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_or>;
	def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_xor>;			def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_xor>;
	def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_inc>;			def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_inc>;
	def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_dec>;			def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_dec>;
	def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_cmpswap>;			def : SourceOfDivergence<int_amdgcn_struct_buffer_atomic_cmpswap>;
	def : SourceOfDivergence<int_amdgcn_ps_live>;			def : SourceOfDivergence<int_amdgcn_ps_live>;
				def : SourceOfDivergence<int_amdgcn_wqm_helper>;
				arsenmUnsubmitted Done Reply Inline Actions Missing DivergenceAnalysis test arsenm: Missing DivergenceAnalysis test
	def : SourceOfDivergence<int_amdgcn_ds_swizzle>;			def : SourceOfDivergence<int_amdgcn_ds_swizzle>;
	def : SourceOfDivergence<int_amdgcn_ds_ordered_add>;			def : SourceOfDivergence<int_amdgcn_ds_ordered_add>;
	def : SourceOfDivergence<int_amdgcn_ds_ordered_swap>;			def : SourceOfDivergence<int_amdgcn_ds_ordered_swap>;
	def : SourceOfDivergence<int_amdgcn_permlane16>;			def : SourceOfDivergence<int_amdgcn_permlane16>;
	def : SourceOfDivergence<int_amdgcn_permlanex16>;			def : SourceOfDivergence<int_amdgcn_permlanex16>;
	def : SourceOfDivergence<int_amdgcn_mov_dpp>;			def : SourceOfDivergence<int_amdgcn_mov_dpp>;
	def : SourceOfDivergence<int_amdgcn_mov_dpp8>;			def : SourceOfDivergence<int_amdgcn_mov_dpp8>;
	def : SourceOfDivergence<int_amdgcn_update_dpp>;			def : SourceOfDivergence<int_amdgcn_update_dpp>;
	Show All 25 Lines

llvm/lib/Target/AMDGPU/SIInsertSkips.cpp

Show First 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	if (TRI->isVGPR(MBB.getParent()->getRegInfo(),
.add(MI.getOperand(1))		.add(MI.getOperand(1))
.addImm(0) // src1 modifiers		.addImm(0) // src1 modifiers
.add(MI.getOperand(0));		.add(MI.getOperand(0));

I.addImm(0); // omod		I.addImm(0); // omod
}		}
break;		break;
}		}
		case AMDGPU::SI_DEMOTE_I1_TERMINATOR:
case AMDGPU::SI_KILL_I1_TERMINATOR: {		case AMDGPU::SI_KILL_I1_TERMINATOR: {
const MachineFunction *MF = MI.getParent()->getParent();		const MachineFunction *MF = MI.getParent()->getParent();
const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();
unsigned Exec = ST.isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;		unsigned Exec = ST.isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
const MachineOperand &Op = MI.getOperand(0);		const MachineOperand &Op = MI.getOperand(0);
int64_t KillVal = MI.getOperand(1).getImm();		int64_t KillVal = MI.getOperand(1).getImm();
assert(KillVal == 0 \|\| KillVal == -1);		assert(KillVal == 0 \|\| KillVal == -1);

▲ Show 20 Lines • Show All 200 Lines • ▼ Show 20 Lines	for (I = MBB.begin(); I != MBB.end(); I = Next) {
// inserted after the current one and let skip the two instructions		// inserted after the current one and let skip the two instructions
// performing the kill if the exec mask is non-zero.		// performing the kill if the exec mask is non-zero.
MI.eraseFromParent();		MI.eraseFromParent();
}		}
break;		break;

case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:		case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
case AMDGPU::SI_KILL_I1_TERMINATOR:		case AMDGPU::SI_KILL_I1_TERMINATOR:
		case AMDGPU::SI_DEMOTE_I1_TERMINATOR:
MadeChange = true;		MadeChange = true;
kill(MI);		kill(MI);

if (ExecBranchStack.empty()) {		if (ExecBranchStack.empty() &&
		MI.getOpcode() != AMDGPU::SI_DEMOTE_I1_TERMINATOR) {
if (NextBB != BE && skipIfDead(MI, *NextBB)) {		if (NextBB != BE && skipIfDead(MI, *NextBB)) {
HaveSkipBlock = true;		HaveSkipBlock = true;
NextBB = std::next(BI);		NextBB = std::next(BI);
BE = MF.end();		BE = MF.end();
}		}
} else {		} else {
HaveKill = true;		HaveKill = true;
}		}
Show All 38 Lines

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 1,420 Lines • ▼ Show 20 Lines	case AMDGPU::S_ANDN2_B64_term:
break;		break;

case AMDGPU::S_ANDN2_B32_term:		case AMDGPU::S_ANDN2_B32_term:
// This is only a terminator to get the correct spill code placement during		// This is only a terminator to get the correct spill code placement during
// register allocation.		// register allocation.
MI.setDesc(get(AMDGPU::S_ANDN2_B32));		MI.setDesc(get(AMDGPU::S_ANDN2_B32));
break;		break;

		case AMDGPU::S_AND_B64_term:
		// This is only a terminator to get the correct spill code placement during
		// register allocation.
		MI.setDesc(get(AMDGPU::S_AND_B64));
		break;

		case AMDGPU::S_AND_B32_term:
		// This is only a terminator to get the correct spill code placement during
		// register allocation.
		MI.setDesc(get(AMDGPU::S_AND_B32));
		break;

case AMDGPU::V_MOV_B64_PSEUDO: {		case AMDGPU::V_MOV_B64_PSEUDO: {
Register Dst = MI.getOperand(0).getReg();		Register Dst = MI.getOperand(0).getReg();
Register DstLo = RI.getSubReg(Dst, AMDGPU::sub0);		Register DstLo = RI.getSubReg(Dst, AMDGPU::sub0);
Register DstHi = RI.getSubReg(Dst, AMDGPU::sub1);		Register DstHi = RI.getSubReg(Dst, AMDGPU::sub1);

const MachineOperand &SrcOp = MI.getOperand(1);		const MachineOperand &SrcOp = MI.getOperand(1);
// FIXME: Will this work for 64-bit floating point immediates?		// FIXME: Will this work for 64-bit floating point immediates?
assert(!SrcOp.isFPImm());		assert(!SrcOp.isFPImm());
▲ Show 20 Lines • Show All 465 Lines • ▼ Show 20 Lines	bool SIInstrInfo::analyzeBranch(MachineBasicBlock &MBB, MachineBasicBlock *&TBB,
// exec management.		// exec management.
while (I != E && !I->isBranch() && !I->isReturn() &&		while (I != E && !I->isBranch() && !I->isReturn() &&
I->getOpcode() != AMDGPU::SI_MASK_BRANCH) {		I->getOpcode() != AMDGPU::SI_MASK_BRANCH) {
switch (I->getOpcode()) {		switch (I->getOpcode()) {
case AMDGPU::SI_MASK_BRANCH:		case AMDGPU::SI_MASK_BRANCH:
case AMDGPU::S_MOV_B64_term:		case AMDGPU::S_MOV_B64_term:
case AMDGPU::S_XOR_B64_term:		case AMDGPU::S_XOR_B64_term:
case AMDGPU::S_ANDN2_B64_term:		case AMDGPU::S_ANDN2_B64_term:
		case AMDGPU::S_AND_B64_term:
case AMDGPU::S_MOV_B32_term:		case AMDGPU::S_MOV_B32_term:
case AMDGPU::S_XOR_B32_term:		case AMDGPU::S_XOR_B32_term:
case AMDGPU::S_OR_B32_term:		case AMDGPU::S_OR_B32_term:
case AMDGPU::S_ANDN2_B32_term:		case AMDGPU::S_ANDN2_B32_term:
		case AMDGPU::S_AND_B32_term:
break;		break;
case AMDGPU::SI_IF:		case AMDGPU::SI_IF:
case AMDGPU::SI_ELSE:		case AMDGPU::SI_ELSE:
case AMDGPU::SI_KILL_I1_TERMINATOR:		case AMDGPU::SI_KILL_I1_TERMINATOR:
case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:		case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
		case AMDGPU::SI_DEMOTE_I1_TERMINATOR:
// FIXME: It's messy that these need to be considered here at all.		// FIXME: It's messy that these need to be considered here at all.
return true;		return true;
default:		default:
llvm_unreachable("unexpected non-branch terminator inst");		llvm_unreachable("unexpected non-branch terminator inst");
}		}

++I;		++I;
}		}
▲ Show 20 Lines • Show All 4,193 Lines • ▼ Show 20 Lines	MachineInstrBuilder SIInstrInfo::getAddNoCarry(MachineBasicBlock &MBB,
return BuildMI(MBB, I, DL, get(AMDGPU::V_ADD_I32_e64), DestReg)		return BuildMI(MBB, I, DL, get(AMDGPU::V_ADD_I32_e64), DestReg)
.addReg(UnusedCarry, RegState::Define \| RegState::Dead);		.addReg(UnusedCarry, RegState::Define \| RegState::Dead);
}		}

bool SIInstrInfo::isKillTerminator(unsigned Opcode) {		bool SIInstrInfo::isKillTerminator(unsigned Opcode) {
switch (Opcode) {		switch (Opcode) {
case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:		case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
case AMDGPU::SI_KILL_I1_TERMINATOR:		case AMDGPU::SI_KILL_I1_TERMINATOR:
		case AMDGPU::SI_DEMOTE_I1_TERMINATOR:
return true;		return true;
default:		default:
return false;		return false;
}		}
}		}

const MCInstrDesc &SIInstrInfo::getKillTerminatorFromPseudo(unsigned Opcode) const {		const MCInstrDesc &SIInstrInfo::getKillTerminatorFromPseudo(unsigned Opcode) const {
switch (Opcode) {		switch (Opcode) {
▲ Show 20 Lines • Show All 314 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	class WrapTerminatorInst<SOP_Pseudo base_inst> : SPseudoInstSI<
let UseNamedOperandTable = base_inst.UseNamedOperandTable;		let UseNamedOperandTable = base_inst.UseNamedOperandTable;
let CodeSize = base_inst.CodeSize;		let CodeSize = base_inst.CodeSize;
}		}

let WaveSizePredicate = isWave64 in {		let WaveSizePredicate = isWave64 in {
def S_MOV_B64_term : WrapTerminatorInst<S_MOV_B64>;		def S_MOV_B64_term : WrapTerminatorInst<S_MOV_B64>;
def S_XOR_B64_term : WrapTerminatorInst<S_XOR_B64>;		def S_XOR_B64_term : WrapTerminatorInst<S_XOR_B64>;
def S_ANDN2_B64_term : WrapTerminatorInst<S_ANDN2_B64>;		def S_ANDN2_B64_term : WrapTerminatorInst<S_ANDN2_B64>;
		def S_AND_B64_term : WrapTerminatorInst<S_AND_B64>;
}		}

let WaveSizePredicate = isWave32 in {		let WaveSizePredicate = isWave32 in {
def S_MOV_B32_term : WrapTerminatorInst<S_MOV_B32>;		def S_MOV_B32_term : WrapTerminatorInst<S_MOV_B32>;
def S_XOR_B32_term : WrapTerminatorInst<S_XOR_B32>;		def S_XOR_B32_term : WrapTerminatorInst<S_XOR_B32>;
def S_OR_B32_term : WrapTerminatorInst<S_OR_B32>;		def S_OR_B32_term : WrapTerminatorInst<S_OR_B32>;
def S_ANDN2_B32_term : WrapTerminatorInst<S_ANDN2_B32>;		def S_ANDN2_B32_term : WrapTerminatorInst<S_ANDN2_B32>;
		def S_AND_B32_term : WrapTerminatorInst<S_AND_B32>;
}		}

def WAVE_BARRIER : SPseudoInstSI<(outs), (ins),		def WAVE_BARRIER : SPseudoInstSI<(outs), (ins),
[(int_amdgcn_wave_barrier)]> {		[(int_amdgcn_wave_barrier)]> {
let SchedRW = [];		let SchedRW = [];
let hasNoSchedulingInfo = 1;		let hasNoSchedulingInfo = 1;
let hasSideEffects = 1;		let hasSideEffects = 1;
let mayLoad = 1;		let mayLoad = 1;
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines
}		}

def SI_PS_LIVE : PseudoInstSI <		def SI_PS_LIVE : PseudoInstSI <
(outs SReg_1:$dst), (ins),		(outs SReg_1:$dst), (ins),
[(set i1:$dst, (int_amdgcn_ps_live))]> {		[(set i1:$dst, (int_amdgcn_ps_live))]> {
let SALU = 1;		let SALU = 1;
}		}

		let Uses = [EXEC] in {
		def SI_WQM_HELPER : PseudoInstSI <
		(outs SReg_1:$dst), (ins),
		[(set i1:$dst, (int_amdgcn_wqm_helper))]> {
		let SALU = 1;
		}

		let Defs = [EXEC] in {
		def SI_DEMOTE_I1 : SPseudoInstSI <(outs), (ins SCSrc_i1:$src, i1imm:$killvalue)> {
		}
		def SI_DEMOTE_I1_TERMINATOR : SPseudoInstSI <(outs), (ins SCSrc_i1:$src, i1imm:$killvalue)> {
		foadUnsubmitted Not Done Reply Inline Actions EXEC,SCC foad: EXEC,SCC
		critsonAuthorUnsubmitted Done Reply Inline Actions Thanks! critson: Thanks!
		let isTerminator = 1;
		}
		} // End Defs = [EXEC]

		} // End Uses = [EXEC]

def SI_MASKED_UNREACHABLE : SPseudoInstSI <(outs), (ins),		def SI_MASKED_UNREACHABLE : SPseudoInstSI <(outs), (ins),
[(int_amdgcn_unreachable)],		[(int_amdgcn_unreachable)],
"; divergent unreachable"> {		"; divergent unreachable"> {
let Size = 0;		let Size = 0;
let hasNoSchedulingInfo = 1;		let hasNoSchedulingInfo = 1;
let FixedSize = 1;		let FixedSize = 1;
}		}

▲ Show 20 Lines • Show All 308 Lines • ▼ Show 20 Lines
>;		>;

def : Pat <		def : Pat <
(int_amdgcn_kill (i1 (not i1:$src))),		(int_amdgcn_kill (i1 (not i1:$src))),
(SI_KILL_I1_PSEUDO $src, -1)		(SI_KILL_I1_PSEUDO $src, -1)
>;		>;

def : Pat <		def : Pat <
		(int_amdgcn_wqm_demote i1:$src),
		(SI_DEMOTE_I1 $src, 0)
		>;

		def : Pat <
		(int_amdgcn_wqm_demote (i1 (not i1:$src))),
		(SI_DEMOTE_I1 $src, -1)
		>;

		def : Pat <
(AMDGPUkill i32:$src),		(AMDGPUkill i32:$src),
(SI_KILL_F32_COND_IMM_PSEUDO $src, 0, 3) // 3 means SETOGE		(SI_KILL_F32_COND_IMM_PSEUDO $src, 0, 3) // 3 means SETOGE
>;		>;

def : Pat <		def : Pat <
(int_amdgcn_kill (i1 (setcc f32:$src, InlineFPImm<f32>:$imm, cond:$cond))),		(int_amdgcn_kill (i1 (setcc f32:$src, InlineFPImm<f32>:$imm, cond:$cond))),
(SI_KILL_F32_COND_IMM_PSEUDO $src, (bitcast_fpimm_to_i32 $imm), (cond_as_i32imm $cond))		(SI_KILL_F32_COND_IMM_PSEUDO $src, (bitcast_fpimm_to_i32 $imm), (cond_as_i32imm $cond))
>;		>;
▲ Show 20 Lines • Show All 1,329 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp

Show First 20 Lines • Show All 208 Lines • ▼ Show 20 Lines	case AMDGPU::S_ANDN2_B64_term: {
return true;		return true;
}		}
case AMDGPU::S_ANDN2_B32_term: {		case AMDGPU::S_ANDN2_B32_term: {
// This is only a terminator to get the correct spill code placement during		// This is only a terminator to get the correct spill code placement during
// register allocation.		// register allocation.
MI.setDesc(TII.get(AMDGPU::S_ANDN2_B32));		MI.setDesc(TII.get(AMDGPU::S_ANDN2_B32));
return true;		return true;
}		}
		case AMDGPU::S_AND_B64_term: {
		// This is only a terminator to get the correct spill code placement during
		// register allocation.
		MI.setDesc(TII.get(AMDGPU::S_AND_B64));
		return true;
		}
		case AMDGPU::S_AND_B32_term: {
		// This is only a terminator to get the correct spill code placement during
		// register allocation.
		MI.setDesc(TII.get(AMDGPU::S_AND_B32));
		return true;
		}
default:		default:
return false;		return false;
}		}
}		}

static MachineBasicBlock::reverse_iterator fixTerminators(		static MachineBasicBlock::reverse_iterator fixTerminators(
const SIInstrInfo &TII,		const SIInstrInfo &TII,
MachineBasicBlock &MBB) {		MachineBasicBlock &MBB) {
▲ Show 20 Lines • Show All 199 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

Show First 20 Lines • Show All 126 Lines • ▼ Show 20 Lines	struct InstrInfo {
char Disabled = 0;		char Disabled = 0;
char OutNeeds = 0;		char OutNeeds = 0;
};		};

struct BlockInfo {		struct BlockInfo {
char Needs = 0;		char Needs = 0;
char InNeeds = 0;		char InNeeds = 0;
char OutNeeds = 0;		char OutNeeds = 0;
		char InitialState = 0;
		unsigned LiveMaskIn = 0; // Initial live mask register
		unsigned LiveMaskOut = 0; // Outgoing live mask register
};		};

struct WorkItem {		struct WorkItem {
MachineBasicBlock *MBB = nullptr;		MachineBasicBlock *MBB = nullptr;
MachineInstr *MI = nullptr;		MachineInstr *MI = nullptr;

WorkItem() = default;		WorkItem() = default;
WorkItem(MachineBasicBlock *MBB) : MBB(MBB) {}		WorkItem(MachineBasicBlock *MBB) : MBB(MBB) {}
WorkItem(MachineInstr *MI) : MI(MI) {}		WorkItem(MachineInstr *MI) : MI(MI) {}
};		};

class SIWholeQuadMode : public MachineFunctionPass {		class SIWholeQuadMode : public MachineFunctionPass {
private:		private:
CallingConv::ID CallingConv;		CallingConv::ID CallingConv;
const SIInstrInfo *TII;		const SIInstrInfo *TII;
const SIRegisterInfo *TRI;		const SIRegisterInfo *TRI;
const GCNSubtarget *ST;		const GCNSubtarget *ST;
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
LiveIntervals *LIS;		LiveIntervals *LIS;

DenseMap<const MachineInstr *, InstrInfo> Instructions;		DenseMap<const MachineInstr *, InstrInfo> Instructions;
DenseMap<MachineBasicBlock *, BlockInfo> Blocks;		DenseMap<MachineBasicBlock *, BlockInfo> Blocks;
SmallVector<MachineInstr *, 1> LiveMaskQueries;
		// Tracks live mask output of instructions
		DenseMap<const MachineInstr *, unsigned> LiveMaskRegs;
		// Tracks state (WQM/WWM/Exact) after a given instruction
		DenseMap<const MachineInstr *, char> StateTransition;

		SmallVector<MachineInstr *, 2> LiveMaskQueries;
SmallVector<MachineInstr *, 4> LowerToCopyInstrs;		SmallVector<MachineInstr *, 4> LowerToCopyInstrs;
		SmallVector<MachineInstr *, 4> DemoteInstrs;

void printInfo();		void printInfo();

void markInstruction(MachineInstr &MI, char Flag,		void markInstruction(MachineInstr &MI, char Flag,
std::vector<WorkItem> &Worklist);		std::vector<WorkItem> &Worklist);
void markInstructionUses(const MachineInstr &MI, char Flag,		void markInstructionUses(const MachineInstr &MI, char Flag,
std::vector<WorkItem> &Worklist);		std::vector<WorkItem> &Worklist);
char scanInstructions(MachineFunction &MF, std::vector<WorkItem> &Worklist);		char scanInstructions(MachineFunction &MF, std::vector<WorkItem> &Worklist);
void propagateInstruction(MachineInstr &MI, std::vector<WorkItem> &Worklist);		void propagateInstruction(MachineInstr &MI, std::vector<WorkItem> &Worklist);
void propagateBlock(MachineBasicBlock &MBB, std::vector<WorkItem> &Worklist);		void propagateBlock(MachineBasicBlock &MBB, std::vector<WorkItem> &Worklist);
char analyzeFunction(MachineFunction &MF);		char analyzeFunction(MachineFunction &MF);

		void scanLiveLanes(MachineBasicBlock &MBB,
		std::vector<MachineBasicBlock *> &Worklist);
		void analyzeLiveLanes(MachineFunction &MF);

bool requiresCorrectState(const MachineInstr &MI) const;		bool requiresCorrectState(const MachineInstr &MI) const;

MachineBasicBlock::iterator saveSCC(MachineBasicBlock &MBB,		MachineBasicBlock::iterator saveSCC(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before);		MachineBasicBlock::iterator Before);
MachineBasicBlock::iterator		MachineBasicBlock::iterator
prepareInsertion(MachineBasicBlock &MBB, MachineBasicBlock::iterator First,		prepareInsertion(MachineBasicBlock &MBB, MachineBasicBlock::iterator First,
MachineBasicBlock::iterator Last, bool PreferLast,		MachineBasicBlock::iterator Last, bool PreferLast,
bool SaveSCC);		bool SaveSCC, bool CheckPhys);
void toExact(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void toExact(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SaveWQM, unsigned LiveMaskReg);		unsigned SaveWQM, unsigned LiveMaskReg);
void toWQM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void toWQM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SavedWQM);		unsigned SavedWQM);
void toWWM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void toWWM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SaveOrig);		unsigned SaveOrig);
void fromWWM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void fromWWM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SavedOrig);		unsigned SavedOrig, char NonWWMState);
void processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg, bool isEntry);

void lowerLiveMaskQueries(unsigned LiveMaskReg);		bool canSplitBlockAt(MachineBasicBlock BB, MachineInstr MI);
		MachineBasicBlock splitBlock(MachineBasicBlock BB,
		MachineInstr *TermMI);
		void lowerBlock(MachineBasicBlock &MBB);

		unsigned findLiveMaskReg(MachineBasicBlock &MBB, BlockInfo &BI,
		MachineBasicBlock::iterator &Before);
		void processBlock(MachineBasicBlock &MBB, bool isEntry);

		bool lowerLiveMaskQueries(unsigned LiveMaskReg);
void lowerCopyInstrs();		void lowerCopyInstrs();
		bool lowerDemoteInstrs();

		void lowerLiveMaskQuery(MachineBasicBlock &MBB,
		MachineInstr &MI,
		unsigned LiveMaskReg,
		bool isWQM);
		bool lowerDemote(MachineBasicBlock &MBB, MachineInstr &MI,
		unsigned LiveMaskIn, unsigned LiveMaskOut,
		bool isWQM);

public:		public:
static char ID;		static char ID;

SIWholeQuadMode() :		SIWholeQuadMode() :
MachineFunctionPass(ID) { }		MachineFunctionPass(ID) { }

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

StringRef getPassName() const override { return "SI Whole Quad Mode"; }		StringRef getPassName() const override { return "SI Whole Quad Mode"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<LiveIntervals>();		AU.addRequired<LiveIntervals>();
AU.addPreserved<SlotIndexes>();
AU.addPreserved<LiveIntervals>();
AU.setPreservesCFG();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace

char SIWholeQuadMode::ID = 0;		char SIWholeQuadMode::ID = 0;

▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
if (!(BBI.InNeeds & StateExact)) {		if (!(BBI.InNeeds & StateExact)) {
BBI.InNeeds \|= StateExact;		BBI.InNeeds \|= StateExact;
Worklist.push_back(&MBB);		Worklist.push_back(&MBB);
}		}
GlobalFlags \|= StateExact;		GlobalFlags \|= StateExact;
III.Disabled = StateWQM \| StateWWM;		III.Disabled = StateWQM \| StateWWM;
continue;		continue;
} else {		} else {
if (Opcode == AMDGPU::SI_PS_LIVE) {		if (Opcode == AMDGPU::SI_PS_LIVE \|\| Opcode == AMDGPU::SI_WQM_HELPER) {
LiveMaskQueries.push_back(&MI);		LiveMaskQueries.push_back(&MI);
		} else if (Opcode == AMDGPU::SI_DEMOTE_I1) {
		DemoteInstrs.push_back(&MI);
} else if (WQMOutputs) {		} else if (WQMOutputs) {
// The function is in machine SSA form, which means that physical		// The function is in machine SSA form, which means that physical
// VGPRs correspond to shader inputs and outputs. Inputs are		// VGPRs correspond to shader inputs and outputs. Inputs are
// only used, outputs are only defined.		// only used, outputs are only defined.
for (const MachineOperand &MO : MI.defs()) {		for (const MachineOperand &MO : MI.defs()) {
if (!MO.isReg())		if (!MO.isReg())
continue;		continue;

▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	if (WI.MI)
propagateInstruction(*WI.MI, Worklist);		propagateInstruction(*WI.MI, Worklist);
else		else
propagateBlock(*WI.MBB, Worklist);		propagateBlock(*WI.MBB, Worklist);
}		}

return GlobalFlags;		return GlobalFlags;
}		}

		// Trace live mask manipulate through block, creating new virtual registers.
		// Additionally insert PHI nodes when block has multiple predecessors
		// which manipulated the mask.
		void SIWholeQuadMode::scanLiveLanes(MachineBasicBlock &MBB,
		std::vector<MachineBasicBlock *> &Worklist) {
		BlockInfo &BI = Blocks[&MBB];

		if (BI.LiveMaskIn && BI.LiveMaskOut)
		return; // Block has been fully traced already.

		if (!BI.LiveMaskIn) {
		// Find the incoming live mask, or insert PHI if there are multiple.
		unsigned LastPredReg = 0;
		unsigned Count = 0;
		bool Valid = true;

		// Find predecessor live masks while performing basic deduplication.
		for (MachineBasicBlock *Pred : MBB.predecessors()) {
		BlockInfo &PredBI = Blocks[Pred];
		if (!PredBI.LiveMaskOut) {
		Valid = false;
		break;
		}
		if (PredBI.LiveMaskOut != LastPredReg) {
		LastPredReg = PredBI.LiveMaskOut;
		Count++;
		}
		}

		if (Valid) {
		// All predecessors have live mask outputs.
		if (Count > 1) {
		BI.LiveMaskIn = MRI->createVirtualRegister(TRI->getBoolRC());
		MachineInstrBuilder PHI = BuildMI(MBB, MBB.begin(), DebugLoc(),
		TII->get(TargetOpcode::PHI),
		BI.LiveMaskIn);
		for (MachineBasicBlock *Pred : MBB.predecessors()) {
		BlockInfo &PredBI = Blocks[Pred];
		PHI.addReg(PredBI.LiveMaskOut);
		PHI.addMBB(Pred);
		}
		LIS->InsertMachineInstrInMaps(*PHI);
		} else {
		BI.LiveMaskIn = LastPredReg;
		}
		} else {
		// Not all predecessor blocks have live mask outputs,
		// so this block will need to be revisited.

		if (!BI.LiveMaskOut) {
		// Give this block a live mask output to ensure forward progress.
		BI.LiveMaskOut = MRI->createVirtualRegister(TRI->getBoolRC());
		}

		// Queue this block to be revisited and visit predecessors.
		Worklist.push_back(&MBB);
		for (MachineBasicBlock *Pred : MBB.predecessors()) {
		BlockInfo &PredBI = Blocks[Pred];
		if (!PredBI.LiveMaskOut)
		Worklist.push_back(Pred);
		}
		return;
		}
		}

		assert(BI.LiveMaskIn);

		// Now that the initial live mask register is known the block can
		// be traced and intermediate live mask registers assigned for instructions
		// which manipulate the mask.
		unsigned CurrentLive = BI.LiveMaskIn;
		auto II = MBB.getFirstNonPHI(), IE = MBB.end();
		while (II != IE) {
		MachineInstr &MI = *II;
		if (MI.getOpcode() == AMDGPU::SI_DEMOTE_I1) {
		unsigned NewLive = MRI->createVirtualRegister(TRI->getBoolRC());
		LiveMaskRegs[&MI] = NewLive;
		CurrentLive = NewLive;
		}
		II++;
		}

		// If an output register was assigned to guarantee forward progress
		// then it is possible the current live register will not become the output
		// live mask register. This will be resolved during block lowering.
		if (!BI.LiveMaskOut) {
		BI.LiveMaskOut = CurrentLive;
		}
		}

		// Scan blocks for live mask manipulation operations in reverse post order
		// to minimise rescans: a block will have to be rescanned if it's
		// predecessors live mask output is not defined.
		void SIWholeQuadMode::analyzeLiveLanes(MachineFunction &MF) {
		std::vector<MachineBasicBlock *> Worklist;

		ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
		for (auto BI = RPOT.begin(), BE = RPOT.end(); BI != BE; ++BI) {
		MachineBasicBlock &MBB = **BI;
		scanLiveLanes(MBB, Worklist);
		}

		while (!Worklist.empty()) {
		MachineBasicBlock *MBB = Worklist.back();
		Worklist.pop_back();
		scanLiveLanes(*MBB, Worklist);
		}
		}

/// Whether \p MI really requires the exec state computed during analysis.		/// Whether \p MI really requires the exec state computed during analysis.
///		///
/// Scalar instructions must occasionally be marked WQM for correct propagation		/// Scalar instructions must occasionally be marked WQM for correct propagation
/// (e.g. thread masks leading up to branches), but when it comes to actual		/// (e.g. thread masks leading up to branches), but when it comes to actual
/// execution, they don't care about EXEC.		/// execution, they don't care about EXEC.
bool SIWholeQuadMode::requiresCorrectState(const MachineInstr &MI) const {		bool SIWholeQuadMode::requiresCorrectState(const MachineInstr &MI) const {
if (MI.isTerminator())		if (MI.isTerminator())
return true;		return true;
Show All 38 Lines	SIWholeQuadMode::saveSCC(MachineBasicBlock &MBB,
return Restore;		return Restore;
}		}

// Return an iterator in the (inclusive) range [First, Last] at which		// Return an iterator in the (inclusive) range [First, Last] at which
// instructions can be safely inserted, keeping in mind that some of the		// instructions can be safely inserted, keeping in mind that some of the
// instructions we want to add necessarily clobber SCC.		// instructions we want to add necessarily clobber SCC.
MachineBasicBlock::iterator SIWholeQuadMode::prepareInsertion(		MachineBasicBlock::iterator SIWholeQuadMode::prepareInsertion(
MachineBasicBlock &MBB, MachineBasicBlock::iterator First,		MachineBasicBlock &MBB, MachineBasicBlock::iterator First,
MachineBasicBlock::iterator Last, bool PreferLast, bool SaveSCC) {		MachineBasicBlock::iterator Last, bool PreferLast, bool SaveSCC,
		bool CheckPhys) {
if (!SaveSCC)		if (!SaveSCC)
return PreferLast ? Last : First;		return PreferLast ? Last : First;

LiveRange &LR = LIS->getRegUnit(*MCRegUnitIterator(AMDGPU::SCC, TRI));		LiveRange &LR = LIS->getRegUnit(*MCRegUnitIterator(AMDGPU::SCC, TRI));
auto MBBE = MBB.end();		auto MBBE = MBB.end();
SlotIndex FirstIdx = First != MBBE ? LIS->getInstructionIndex(*First)		SlotIndex FirstIdx = First != MBBE ? LIS->getInstructionIndex(*First)
: LIS->getMBBEndIdx(&MBB);		: LIS->getMBBEndIdx(&MBB);
SlotIndex LastIdx =		SlotIndex LastIdx =
Show All 16 Lines	if (PreferLast) {
if (Next > LastIdx)		if (Next > LastIdx)
break;		break;
Idx = Next;		Idx = Next;
}		}
}		}

MachineBasicBlock::iterator MBBI;		MachineBasicBlock::iterator MBBI;

if (MachineInstr *MI = LIS->getInstructionFromIndex(Idx))		if (MachineInstr *MI = LIS->getInstructionFromIndex(Idx)) {
MBBI = MI;		MBBI = MI;
else {		if (CheckPhys) {
		// Make sure insertion point is after any COPY instructions
		// accessing physical live in registers. This is ensures that
		// block splitting does not occur before all live ins have been copied.
		while (MBBI != Last) {
		if (MBBI->getOpcode() != AMDGPU::COPY)
		break;
		unsigned Src = MBBI->getOperand(1).getReg();
		if (!Register::isVirtualRegister(Src) && MBB.isLiveIn(Src)) {
		MBBI++;
		arsenmUnsubmitted Not Done Reply Inline Actions use Register arsenm: use Register
		critsonAuthorUnsubmitted Done Reply Inline Actions I assume you mean the variable name Src -> Register? critson: I assume you mean the variable name Src -> Register?
		foadUnsubmitted Not Done Reply Inline Actions I'm pretty sure he meant s/unsigned Src/Register Src/. clang-tidy thinks the same. foad: I'm pretty sure he meant s/unsigned Src/Register Src/. clang-tidy thinks the same.
		} else {
		break;
		}
		}
		}
		} else {
assert(Idx == LIS->getMBBEndIdx(&MBB));		assert(Idx == LIS->getMBBEndIdx(&MBB));
MBBI = MBB.end();		MBBI = MBB.end();
}		}

if (S)		if (S)
MBBI = saveSCC(MBB, MBBI);		MBBI = saveSCC(MBB, MBBI);

return MBBI;		return MBBI;
Show All 14 Lines	if (SaveWQM) {
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(ST->isWave32() ?		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(ST->isWave32() ?
AMDGPU::S_AND_B32 : AMDGPU::S_AND_B64),		AMDGPU::S_AND_B32 : AMDGPU::S_AND_B64),
Exec)		Exec)
.addReg(Exec)		.addReg(Exec)
.addReg(LiveMaskReg);		.addReg(LiveMaskReg);
}		}

LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
		StateTransition[MI] = StateExact;
}		}

void SIWholeQuadMode::toWQM(MachineBasicBlock &MBB,		void SIWholeQuadMode::toWQM(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before,		MachineBasicBlock::iterator Before,
unsigned SavedWQM) {		unsigned SavedWQM) {
MachineInstr *MI;		MachineInstr *MI;

unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;		unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
if (SavedWQM) {		if (SavedWQM) {
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::COPY), Exec)		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::COPY), Exec)
.addReg(SavedWQM);		.addReg(SavedWQM);
} else {		} else {
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(ST->isWave32() ?		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(ST->isWave32() ?
AMDGPU::S_WQM_B32 : AMDGPU::S_WQM_B64),		AMDGPU::S_WQM_B32 : AMDGPU::S_WQM_B64),
Exec)		Exec)
.addReg(Exec);		.addReg(Exec);
}		}

LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
		StateTransition[MI] = StateWQM;
}		}

void SIWholeQuadMode::toWWM(MachineBasicBlock &MBB,		void SIWholeQuadMode::toWWM(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before,		MachineBasicBlock::iterator Before,
unsigned SaveOrig) {		unsigned SaveOrig) {
MachineInstr *MI;		MachineInstr *MI;

assert(SaveOrig);		assert(SaveOrig);
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::ENTER_WWM), SaveOrig)		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::ENTER_WWM), SaveOrig)
.addImm(-1);		.addImm(-1);
LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
		StateTransition[MI] = StateWWM;
}		}

void SIWholeQuadMode::fromWWM(MachineBasicBlock &MBB,		void SIWholeQuadMode::fromWWM(MachineBasicBlock &MBB,
MachineBasicBlock::iterator Before,		MachineBasicBlock::iterator Before,
unsigned SavedOrig) {		unsigned SavedOrig,
		char NonWWMState) {
MachineInstr *MI;		MachineInstr *MI;

assert(SavedOrig);		assert(SavedOrig);
MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::EXIT_WWM),		MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::EXIT_WWM),
ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC)		ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC)
.addReg(SavedOrig);		.addReg(SavedOrig);
LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
		StateTransition[MI] = NonWWMState;
}		}

void SIWholeQuadMode::processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg,		void SIWholeQuadMode::lowerLiveMaskQuery(MachineBasicBlock &MBB,
bool isEntry) {		MachineInstr &MI,
		unsigned LiveMaskReg,
		bool isWQM) {
		const DebugLoc &DL = MI.getDebugLoc();
		unsigned Dest = MI.getOperand(0).getReg();
		MachineInstr *Copy =
		BuildMI(MBB, MI, DL, TII->get(AMDGPU::COPY), Dest)
		.addReg(LiveMaskReg);
		LIS->ReplaceMachineInstrInMaps(MI, *Copy);
		MBB.remove(&MI);
		}

		// Lower an instruction which demotes lanes to helpers by adding
		// appropriate live mask manipulation. Note this is also applied to kills.
		bool SIWholeQuadMode::lowerDemote(MachineBasicBlock &MBB, MachineInstr &MI,
		unsigned LiveMaskIn, unsigned LiveMaskOut,
		bool isWQM) {
		const unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
		const unsigned AndN2 =
		ST->isWave32() ? AMDGPU::S_ANDN2_B32 : AMDGPU::S_ANDN2_B64;
		const unsigned And =
		ST->isWave32() ? AMDGPU::S_AND_B32 : AMDGPU::S_AND_B64;

		const DebugLoc &DL = MI.getDebugLoc();
		MachineInstr *NewMI = nullptr;
		bool NeedSplit = false;

		const MachineOperand &Op = MI.getOperand(0);
		int64_t KillVal = MI.getOperand(1).getImm();
		if (Op.isImm()) {
		int64_t Imm = Op.getImm();
		if (Imm == KillVal) {
		NewMI = BuildMI(MBB, MI, DL,
		TII->get(AndN2),
		LiveMaskOut)
		.addReg(LiveMaskIn)
		.addReg(Exec);
		}
		} else {
		unsigned Opcode = KillVal ? AndN2 : And;
		NewMI = BuildMI(MBB, MI, DL,
		TII->get(Opcode),
		LiveMaskOut)
		.addReg(LiveMaskIn)
		.add(Op);
		}

		if (MI.getOpcode() == AMDGPU::SI_DEMOTE_I1) {
		if (isWQM) {
		// Inside WQM demotes are replaced with live mask manipulation
		LIS->RemoveMachineInstrFromMaps(MI);
		MBB.remove(&MI);
		} else {
		// Outside WQM demotes become kills terminating the block
		NeedSplit = true;
		}
		}

		if (NewMI) {
		LIS->InsertMachineInstrInMaps(*NewMI);
		}

		return NeedSplit;
		}

		bool SIWholeQuadMode::canSplitBlockAt(MachineBasicBlock *BB,
		MachineInstr *MI) {
		// Cannot split immediately before the epilog
		// because there are values in physical registers
		if (MI->getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG) {
		return false;
		}

		return true;
		}

		MachineBasicBlock SIWholeQuadMode::splitBlock(MachineBasicBlock BB,
		MachineInstr *TermMI) {
		MachineBasicBlock::iterator SplitPoint(TermMI);
		SplitPoint++;

		LLVM_DEBUG(dbgs() << "Split block " << printMBBReference(*BB)
		<< " @ " << *TermMI << "\n");

		MachineBasicBlock *SplitBB = nullptr;

		// Only split the block if the split point is not
		// already the end of the block.
		if (SplitPoint != BB->getFirstTerminator()) {
		MachineFunction *MF = BB->getParent();
		SplitBB = MF->CreateMachineBasicBlock(BB->getBasicBlock());

		MachineFunction::iterator MBBI(BB);
		++MBBI;
		MF->insert(MBBI, SplitBB);

		SplitBB->splice(SplitBB->begin(), BB, SplitPoint, BB->end());
		SplitBB->transferSuccessorsAndUpdatePHIs(BB);
		BB->addSuccessor(SplitBB);
		}

		// Convert last instruction in to a terminator.
		// Note: this only covers the expected patterns
		switch (TermMI->getOpcode()) {
		case AMDGPU::S_AND_B32:
		TermMI->setDesc(TII->get(AMDGPU::S_AND_B32_term));
		break;
		case AMDGPU::S_AND_B64:
		TermMI->setDesc(TII->get(AMDGPU::S_AND_B64_term));
		break;
		case AMDGPU::SI_DEMOTE_I1:
		TermMI->setDesc(TII->get(AMDGPU::SI_DEMOTE_I1_TERMINATOR));
		break;
		case AMDGPU::SI_DEMOTE_I1_TERMINATOR:
		break;
		default:
		if (BB->getFirstTerminator() == BB->end()) {
		assert(SplitBB != nullptr);
		BuildMI(*BB, BB->end(), DebugLoc(), TII->get(AMDGPU::S_BRANCH))
		.addMBB(SplitBB);
		}
		break;
		}

		return SplitBB;
		}

		// Replace (or supplement) instructions accessing live mask.
		// This can only happen once all the live mask registers have been created
		// and the execute state (WQM/WWM/Exact) of instructions is known.
		void SIWholeQuadMode::lowerBlock(MachineBasicBlock &MBB) {
auto BII = Blocks.find(&MBB);		auto BII = Blocks.find(&MBB);
if (BII == Blocks.end())		if (BII == Blocks.end())
return;		return;

		LLVM_DEBUG(dbgs() << "\nLowering block " << printMBBReference(MBB)
		<< ":\n");

const BlockInfo &BI = BII->second;		const BlockInfo &BI = BII->second;

		SmallVector<MachineInstr *, 4> SplitPoints;
		unsigned LiveMaskReg = BI.LiveMaskIn;
		char State = BI.InitialState;

		auto II = MBB.getFirstNonPHI(), IE = MBB.end();
		while (II != IE) {
		auto Next = std::next(II);
		MachineInstr &MI = *II;

		if (StateTransition.count(&MI)) {
		// Mark transitions to Exact mode as split points so they become
		// block terminators.
		if (State != StateTransition[&MI] && StateTransition[&MI] == StateExact) {
		if (State != StateWWM && canSplitBlockAt(&MBB, &MI))
		SplitPoints.push_back(&MI);
		}
		State = StateTransition[&MI];
		}

		switch (MI.getOpcode()) {
		case AMDGPU::SI_PS_LIVE:
		case AMDGPU::SI_WQM_HELPER:
		lowerLiveMaskQuery(MBB, MI, LiveMaskReg, State == StateWQM);
		break;
		case AMDGPU::SI_DEMOTE_I1: {
		bool NeedSplit = lowerDemote(MBB, MI, LiveMaskReg,
		LiveMaskRegs[&MI],
		State == StateWQM);
		if (NeedSplit)
		SplitPoints.push_back(&MI);
		break;
		}
		default:
		break;
		}

		if (LiveMaskRegs.count(&MI))
		LiveMaskReg = LiveMaskRegs[&MI];

		II = Next;
		}

		if (BI.LiveMaskOut != LiveMaskReg) {
		// If the final live mask register does not match the expected
		// register of successor blocks then insert a copy.
		MachineBasicBlock::instr_iterator Terminator =
		MBB.getFirstInstrTerminator();
		MachineInstr *MI = BuildMI(MBB, Terminator, DebugLoc(),
		TII->get(AMDGPU::COPY), BI.LiveMaskOut)
		.addReg(LiveMaskReg);
		LIS->InsertMachineInstrInMaps(*MI);
		}

		// Perform splitting after instruction scan to simplify iteration.
		if (!SplitPoints.empty()) {
		MachineBasicBlock *BB = &MBB;
		for (MachineInstr *MI : SplitPoints) {
		BB = splitBlock(BB, MI);
		}
		}
		}

		unsigned SIWholeQuadMode::findLiveMaskReg(MachineBasicBlock &MBB, BlockInfo &BI,
		MachineBasicBlock::iterator &Before) {
		assert(BI.LiveMaskIn);
		if (BI.LiveMaskIn == BI.LiveMaskOut)
		return BI.LiveMaskIn;

		// FIXME: make this more efficient than scanning all instructions in a block
		unsigned LiveMaskReg = BI.LiveMaskIn;
		auto II = MBB.getFirstNonPHI(), IE = MBB.end();

		while ((II != IE) && (II != Before)) {
		MachineInstr I = &II;
		if (LiveMaskRegs.count(I))
		LiveMaskReg = LiveMaskRegs[I];
		II++;
		}

		assert(LiveMaskReg);
		return LiveMaskReg;
		}

		void SIWholeQuadMode::processBlock(MachineBasicBlock &MBB, bool isEntry) {
		auto BII = Blocks.find(&MBB);
		if (BII == Blocks.end())
		return;

		BlockInfo &BI = BII->second;

// This is a non-entry block that is WQM throughout, so no need to do		// This is a non-entry block that is WQM throughout, so no need to do
// anything.		// anything.
if (!isEntry && BI.Needs == StateWQM && BI.OutNeeds != StateExact)		if (!isEntry && BI.Needs == StateWQM && BI.OutNeeds != StateExact) {
		BI.InitialState = StateWQM;
return;		return;
		}

LLVM_DEBUG(dbgs() << "\nProcessing block " << printMBBReference(MBB)		LLVM_DEBUG(dbgs() << "\nProcessing block " << printMBBReference(MBB)
<< ":\n");		<< ":\n");

unsigned SavedWQMReg = 0;		unsigned SavedWQMReg = 0;
unsigned SavedNonWWMReg = 0;		unsigned SavedNonWWMReg = 0;
bool WQMFromExec = isEntry;		bool WQMFromExec = isEntry;
char State = (isEntry \|\| !(BI.InNeeds & StateWQM)) ? StateExact : StateWQM;		char State = (isEntry \|\| !(BI.InNeeds & StateWQM)) ? StateExact : StateWQM;
char NonWWMState = 0;		char NonWWMState = 0;
const TargetRegisterClass *BoolRC = TRI->getBoolRC();		const TargetRegisterClass *BoolRC = TRI->getBoolRC();

auto II = MBB.getFirstNonPHI(), IE = MBB.end();		auto II = MBB.getFirstNonPHI(), IE = MBB.end();
if (isEntry)		if (isEntry)
++II; // Skip the instruction that saves LiveMask		++II; // Skip the instruction that saves LiveMask

// This stores the first instruction where it's safe to switch from WQM to		// This stores the first instruction where it's safe to switch from WQM to
// Exact or vice versa.		// Exact or vice versa.
MachineBasicBlock::iterator FirstWQM = IE;		MachineBasicBlock::iterator FirstWQM = IE;

// This stores the first instruction where it's safe to switch from WWM to		// This stores the first instruction where it's safe to switch from WWM to
// Exact/WQM or to switch to WWM. It must always be the same as, or after,		// Exact/WQM or to switch to WWM. It must always be the same as, or after,
// FirstWQM since if it's safe to switch to/from WWM, it must be safe to		// FirstWQM since if it's safe to switch to/from WWM, it must be safe to
// switch to/from WQM as well.		// switch to/from WQM as well.
MachineBasicBlock::iterator FirstWWM = IE;		MachineBasicBlock::iterator FirstWWM = IE;

		// Record initial state is block information.
		BI.InitialState = State;

for (;;) {		for (;;) {
MachineBasicBlock::iterator Next = II;		MachineBasicBlock::iterator Next = II;
char Needs = StateExact \| StateWQM; // WWM is disabled by default		char Needs = StateExact \| StateWQM; // WWM is disabled by default
char OutNeeds = 0;		char OutNeeds = 0;

if (FirstWQM == IE)		if (FirstWQM == IE)
FirstWQM = II;		FirstWQM = II;

▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	if (!(Needs & State)) {
First = FirstWWM;		First = FirstWWM;
} else {		} else {
// We only need to switch to/from WQM, so we can use FirstWQM		// We only need to switch to/from WQM, so we can use FirstWQM
First = FirstWQM;		First = FirstWQM;
}		}

MachineBasicBlock::iterator Before =		MachineBasicBlock::iterator Before =
prepareInsertion(MBB, First, II, Needs == StateWQM,		prepareInsertion(MBB, First, II, Needs == StateWQM,
Needs == StateExact \|\| WQMFromExec);		Needs == StateExact \|\| WQMFromExec,
		Needs == StateExact && isEntry);

if (State == StateWWM) {		if (State == StateWWM) {
assert(SavedNonWWMReg);		assert(SavedNonWWMReg);
fromWWM(MBB, Before, SavedNonWWMReg);		fromWWM(MBB, Before, SavedNonWWMReg, NonWWMState);
State = NonWWMState;		State = NonWWMState;
}		}

if (Needs == StateWWM) {		if (Needs == StateWWM) {
NonWWMState = State;		NonWWMState = State;
SavedNonWWMReg = MRI->createVirtualRegister(BoolRC);		SavedNonWWMReg = MRI->createVirtualRegister(BoolRC);
toWWM(MBB, Before, SavedNonWWMReg);		toWWM(MBB, Before, SavedNonWWMReg);
State = StateWWM;		State = StateWWM;
} else {		} else {
if (State == StateWQM && (Needs & StateExact) && !(Needs & StateWQM)) {		if (State == StateWQM && (Needs & StateExact) && !(Needs & StateWQM)) {
if (!WQMFromExec && (OutNeeds & StateWQM))		if (!WQMFromExec && (OutNeeds & StateWQM))
SavedWQMReg = MRI->createVirtualRegister(BoolRC);		SavedWQMReg = MRI->createVirtualRegister(BoolRC);

toExact(MBB, Before, SavedWQMReg, LiveMaskReg);		toExact(MBB, Before, SavedWQMReg, findLiveMaskReg(MBB, BI, Before));
State = StateExact;		State = StateExact;
} else if (State == StateExact && (Needs & StateWQM) &&		} else if (State == StateExact && (Needs & StateWQM) &&
!(Needs & StateExact)) {		!(Needs & StateExact)) {
assert(WQMFromExec == (SavedWQMReg == 0));		assert(WQMFromExec == (SavedWQMReg == 0));

toWQM(MBB, Before, SavedWQMReg);		toWQM(MBB, Before, SavedWQMReg);

if (SavedWQMReg) {		if (SavedWQMReg) {
Show All 12 Lines	for (;;) {
if (Needs != (StateExact \| StateWQM \| StateWWM)) {		if (Needs != (StateExact \| StateWQM \| StateWWM)) {
if (Needs != (StateExact \| StateWQM))		if (Needs != (StateExact \| StateWQM))
FirstWQM = IE;		FirstWQM = IE;
FirstWWM = IE;		FirstWWM = IE;
}		}

if (II == IE)		if (II == IE)
break;		break;

II = Next;		II = Next;
}		}
}		}

void SIWholeQuadMode::lowerLiveMaskQueries(unsigned LiveMaskReg) {		bool SIWholeQuadMode::lowerLiveMaskQueries(unsigned LiveMaskReg) {
		bool Changed = false;
for (MachineInstr *MI : LiveMaskQueries) {		for (MachineInstr *MI : LiveMaskQueries) {
const DebugLoc &DL = MI->getDebugLoc();		const DebugLoc &DL = MI->getDebugLoc();
Register Dest = MI->getOperand(0).getReg();		Register Dest = MI->getOperand(0).getReg();
MachineInstr *Copy =		MachineInstr *Copy =
BuildMI(*MI->getParent(), MI, DL, TII->get(AMDGPU::COPY), Dest)		BuildMI(*MI->getParent(), MI, DL, TII->get(AMDGPU::COPY), Dest)
.addReg(LiveMaskReg);		.addReg(LiveMaskReg);

LIS->ReplaceMachineInstrInMaps(MI, Copy);		LIS->ReplaceMachineInstrInMaps(MI, Copy);
MI->eraseFromParent();		MI->eraseFromParent();
		Changed = true;
		}
		return Changed;
		}

		bool SIWholeQuadMode::lowerDemoteInstrs() {
		bool Changed = false;
		for (MachineInstr *MI : DemoteInstrs) {
		MachineBasicBlock *MBB = MI->getParent();
		splitBlock(MBB, MI);
		Changed = true;
}		}
		return Changed;
}		}

void SIWholeQuadMode::lowerCopyInstrs() {		void SIWholeQuadMode::lowerCopyInstrs() {
for (MachineInstr *MI : LowerToCopyInstrs) {		for (MachineInstr *MI : LowerToCopyInstrs) {
for (unsigned i = MI->getNumExplicitOperands() - 1; i > 1; i--)		for (unsigned i = MI->getNumExplicitOperands() - 1; i > 1; i--)
MI->RemoveOperand(i);		MI->RemoveOperand(i);

const Register Reg = MI->getOperand(0).getReg();		const Register Reg = MI->getOperand(0).getReg();
Show All 14 Lines	void SIWholeQuadMode::lowerCopyInstrs() {
}		}
}		}

bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {		bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {
Instructions.clear();		Instructions.clear();
Blocks.clear();		Blocks.clear();
LiveMaskQueries.clear();		LiveMaskQueries.clear();
LowerToCopyInstrs.clear();		LowerToCopyInstrs.clear();
		DemoteInstrs.clear();
		LiveMaskRegs.clear();
		StateTransition.clear();

CallingConv = MF.getFunction().getCallingConv();		CallingConv = MF.getFunction().getCallingConv();

ST = &MF.getSubtarget<GCNSubtarget>();		ST = &MF.getSubtarget<GCNSubtarget>();

TII = ST->getInstrInfo();		TII = ST->getInstrInfo();
TRI = &TII->getRegisterInfo();		TRI = &TII->getRegisterInfo();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
LIS = &getAnalysis<LiveIntervals>();		LIS = &getAnalysis<LiveIntervals>();

char GlobalFlags = analyzeFunction(MF);		const char GlobalFlags = analyzeFunction(MF);
unsigned LiveMaskReg = 0;		const bool NeedsLiveMask =
unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;		!(DemoteInstrs.empty() && LiveMaskQueries.empty());
if (!(GlobalFlags & StateWQM)) {		const unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
lowerLiveMaskQueries(Exec);		unsigned LiveMaskReg = Exec;
if (!(GlobalFlags & StateWWM) && LowerToCopyInstrs.empty())
return !LiveMaskQueries.empty();		if (!(GlobalFlags & (StateWQM \| StateWWM)) && LowerToCopyInstrs.empty()) {
} else {		// Shader only needs Exact mode
// Store a copy of the original live mask when required		const bool LoweredQueries = lowerLiveMaskQueries(LiveMaskReg);
		const bool LoweredDemotes = lowerDemoteInstrs();
		return LoweredQueries \|\| LoweredDemotes;
		}

MachineBasicBlock &Entry = MF.front();		MachineBasicBlock &Entry = MF.front();
MachineBasicBlock::iterator EntryMI = Entry.getFirstNonPHI();		MachineBasicBlock::iterator EntryMI = Entry.getFirstNonPHI();

if (GlobalFlags & StateExact \|\| !LiveMaskQueries.empty()) {		// Store a copy of the original live mask when required
		if (NeedsLiveMask \|\| (GlobalFlags & StateWQM)) {
LiveMaskReg = MRI->createVirtualRegister(TRI->getBoolRC());		LiveMaskReg = MRI->createVirtualRegister(TRI->getBoolRC());
MachineInstr *MI = BuildMI(Entry, EntryMI, DebugLoc(),		MachineInstr *MI = BuildMI(Entry, EntryMI, DebugLoc(),
TII->get(AMDGPU::COPY), LiveMaskReg)		TII->get(AMDGPU::COPY), LiveMaskReg)
.addReg(Exec);		.addReg(Exec);
LIS->InsertMachineInstrInMaps(*MI);		LIS->InsertMachineInstrInMaps(*MI);
}		}

lowerLiveMaskQueries(LiveMaskReg);		if ((GlobalFlags == StateWQM) && DemoteInstrs.empty()) {
		// Shader only needs WQM
if (GlobalFlags == StateWQM) {
// For a shader that needs only WQM, we can just set it once.
BuildMI(Entry, EntryMI, DebugLoc(), TII->get(ST->isWave32() ?		BuildMI(Entry, EntryMI, DebugLoc(), TII->get(ST->isWave32() ?
AMDGPU::S_WQM_B32 : AMDGPU::S_WQM_B64),		AMDGPU::S_WQM_B32 : AMDGPU::S_WQM_B64),
Exec)		Exec)
.addReg(Exec);		.addReg(Exec);

		lowerLiveMaskQueries(LiveMaskReg);
lowerCopyInstrs();		lowerCopyInstrs();
// EntryMI may become invalid here
return true;		return true;
}		}

		if (NeedsLiveMask && (GlobalFlags & StateWQM)) {
		BlockInfo &BI = Blocks[&Entry];
		BI.LiveMaskIn = LiveMaskReg;
		analyzeLiveLanes(MF);
		} else {
		for (auto BII : Blocks) {
		BlockInfo &BI = Blocks[&*BII.first];
		BI.LiveMaskIn = LiveMaskReg;
		BI.LiveMaskOut = LiveMaskReg;
		}
}		}

LLVM_DEBUG(printInfo());		LLVM_DEBUG(printInfo());

lowerCopyInstrs();		lowerCopyInstrs();

// Handle the general case		for (auto BII : Blocks) {
for (auto BII : Blocks)		processBlock(*BII.first, BII.first == &Entry);
processBlock(BII.first, LiveMaskReg, BII.first == &MF.begin());		}

// Physical registers like SCC aren't tracked by default anyway, so just		if (NeedsLiveMask && (GlobalFlags & StateWQM)) {
// removing the ranges we computed is the simplest option for maintaining		// Lowering blocks causes block splitting.
// the analysis results.		// Hence live ranges and slot indexes cease to be valid here.
LIS->removeRegUnit(*MCRegUnitIterator(AMDGPU::SCC, TRI));		for (auto BII : Blocks) {
		lowerBlock(*BII.first);
		}
		} else {
		lowerLiveMaskQueries(LiveMaskReg);
		lowerDemoteInstrs();
		}

return true;		return true;
}		}

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SI,GCN-64 %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,GCN-64 %s
				; RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32,-wavefrontsize64 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX10,GCN-32 %s
				; RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX10,GCN-64 %s

				; GCN-LABEL: {{^}}static_exact:
				; GCN-32: v_cmp_gt_f32_e32 [[CMP:vcc_lo]], 0, v0
				; GCN-64: v_cmp_gt_f32_e32 [[CMP:vcc]], 0, v0
				; GCN-32: s_mov_b32 exec_lo, 0
				; GCN-64: s_mov_b64 exec, 0
				; GCN: v_cndmask_b32_e64 v{{[0-9]+}}, 0, 1.0, [[CMP]]
				; GCN: exp mrt1 v0, v0, v0, v0 done vm
				define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
				.entry:
				%c0 = fcmp olt float %arg0, 0.000000e+00
				%c1 = fcmp oge float %arg1, 0.0
				call void @llvm.amdgcn.wqm.demote(i1 false)
				%tmp1 = select i1 %c0, float 1.000000e+00, float 0.000000e+00
				call void @llvm.amdgcn.exp.f32(i32 1, i32 15, float %tmp1, float %tmp1, float %tmp1, float %tmp1, i1 true, i1 true) #0
				ret void
				}

				; GCN-LABEL: {{^}}dynamic_exact:
				; GCN-32: v_cmp_le_f32_e64 [[CND:s[0-9]+]], 0, v1
				; GCN-64: v_cmp_le_f32_e64 [[CND:s\[[0-9]+:[0-9]+\]]], 0, v1
				; GCN-32: v_cmp_gt_f32_e32 [[CMP:vcc_lo]], 0, v0
				; GCN-64: v_cmp_gt_f32_e32 [[CMP:vcc]], 0, v0
				; GCN-32: s_and_b32 exec_lo, exec_lo, [[CND]]
				; GCN-64: s_and_b64 exec, exec, [[CND]]
				; GCN: v_cndmask_b32_e64 v{{[0-9]+}}, 0, 1.0, [[CMP]]
				; GCN: exp mrt1 v0, v0, v0, v0 done vm
				define amdgpu_ps void @dynamic_exact(float %arg0, float %arg1) {
				.entry:
				%c0 = fcmp olt float %arg0, 0.000000e+00
				%c1 = fcmp oge float %arg1, 0.0
				call void @llvm.amdgcn.wqm.demote(i1 %c1)
				%tmp1 = select i1 %c0, float 1.000000e+00, float 0.000000e+00
				call void @llvm.amdgcn.exp.f32(i32 1, i32 15, float %tmp1, float %tmp1, float %tmp1, float %tmp1, i1 true, i1 true) #0
				ret void
				}

				; GCN-LABEL: {{^}}branch:
				; GCN-32: s_and_saveexec_b32 s1, s0
				; GCN-64: s_and_saveexec_b64 s[2:3], s[0:1]
				; GCN-32: s_xor_b32 s0, exec_lo, s1
				; GCN-64: s_xor_b64 s[0:1], exec, s[2:3]
				; GCN-32: s_mov_b32 exec_lo, 0
				; GCN-64: s_mov_b64 exec, 0
				; GCN-32: s_or_b32 exec_lo, exec_lo, s0
				; GCN-64: s_or_b64 exec, exec, s[0:1]
				; GCN: v_cndmask_b32_e64 v0, 0, 1.0, vcc
				; GCN: exp mrt1 v0, v0, v0, v0 done vm
				define amdgpu_ps void @branch(float %arg0, float %arg1) {
				.entry:
				%i0 = fptosi float %arg0 to i32
				%i1 = fptosi float %arg1 to i32
				%c0 = or i32 %i0, %i1
				%c1 = and i32 %c0, 1
				%c2 = icmp eq i32 %c1, 0
				br i1 %c2, label %.continue, label %.demote

				.demote:
				call void @llvm.amdgcn.wqm.demote(i1 false)
				br label %.continue

				.continue:
				%tmp1 = select i1 %c2, float 1.000000e+00, float 0.000000e+00
				call void @llvm.amdgcn.exp.f32(i32 1, i32 15, float %tmp1, float %tmp1, float %tmp1, float %tmp1, i1 true, i1 true) #0
				ret void
				}


				; GCN-LABEL: {{^}}wqm_demote_1:
				; GCN-NEXT: ; %.entry
				; GCN-32: s_mov_b32 [[ORIG:s[0-9]+]], exec_lo
				; GCN-64: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				; GCN-32: s_wqm_b32 exec_lo, exec_lo
				; GCN-64: s_wqm_b64 exec, exec
				; GCN: ; %.demote
				; GCN-32-NEXT: s_andn2_b32 [[LIVE:s[0-9]+]], [[ORIG]], exec
				; GCN-64-NEXT: s_andn2_b64 [[LIVE:s\[[0-9]+:[0-9]+\]]], [[ORIG]], exec
				; GCN: ; %.continue
				; GCN: image_sample
				; GCN: v_add_f32_e32
				; GCN-32: s_and_b32 exec_lo, exec_lo, [[LIVE]]
				; GCN-64: s_and_b64 exec, exec, [[LIVE]]
				; GCN: image_sample
				define amdgpu_ps <4 x float> @wqm_demote_1(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, i32 %idx, float %data, float %coord, float %coord2, float %z) {
				.entry:
				%z.cmp = fcmp olt float %z, 0.0
				br i1 %z.cmp, label %.continue, label %.demote

				.demote:
				call void @llvm.amdgcn.wqm.demote(i1 false)
				br label %.continue

				.continue:
				%tex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i1 0, i32 0, i32 0) #0
				%tex0 = extractelement <4 x float> %tex, i32 0
				%tex1 = extractelement <4 x float> %tex, i32 0
				%coord1 = fadd float %tex0, %tex1
				%rtex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %coord1, <8 x i32> %rsrc, <4 x i32> %sampler, i1 0, i32 0, i32 0) #0

				ret <4 x float> %rtex
				}

				; GCN-LABEL: {{^}}wqm_demote_2:
				; GCN-NEXT: ; %.entry
				; GCN-32: s_mov_b32 [[ORIG:s[0-9]+]], exec_lo
				; GCN-64: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				; GCN-32: s_wqm_b32 exec_lo, exec_lo
				; GCN-64: s_wqm_b64 exec, exec
				; GCN: image_sample
				; GCN: ; %.demote
				; GCN-32-NEXT: s_andn2_b32 [[LIVE:s[0-9]+]], [[ORIG]], exec
				; GCN-64-NEXT: s_andn2_b64 [[LIVE:s\[[0-9]+:[0-9]+\]]], [[ORIG]], exec
				; GCN: ; %.continue
				; GCN: v_add_f32_e32
				; GCN-32: s_and_b32 exec_lo, exec_lo, [[LIVE]]
				; GCN-64: s_and_b64 exec, exec, [[LIVE]]
				; GCN: image_sample
				define amdgpu_ps <4 x float> @wqm_demote_2(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, i32 %idx, float %data, float %coord, float %coord2, float %z) {
				.entry:
				%tex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i1 0, i32 0, i32 0) #0
				%tex0 = extractelement <4 x float> %tex, i32 0
				%tex1 = extractelement <4 x float> %tex, i32 0
				%z.cmp = fcmp olt float %tex0, 0.0
				br i1 %z.cmp, label %.continue, label %.demote

				.demote:
				call void @llvm.amdgcn.wqm.demote(i1 false)
				br label %.continue

				.continue:
				%coord1 = fadd float %tex0, %tex1
				%rtex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %coord1, <8 x i32> %rsrc, <4 x i32> %sampler, i1 0, i32 0, i32 0) #0

				ret <4 x float> %rtex
				}


				; GCN-LABEL: {{^}}wqm_deriv:
				; GCN-NEXT: ; %.entry
				; GCN-32: s_mov_b32 [[ORIG:s[0-9]+]], exec_lo
				; GCN-64: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				; GCN-32: s_wqm_b32 exec_lo, exec_lo
				; GCN-64: s_wqm_b64 exec, exec
				; GCN: ; %.demote0
				; GCN-32-NEXT: s_andn2_b32 [[LIVE:s[0-9]+]], [[ORIG]], exec
				; GCN-64-NEXT: s_andn2_b64 [[LIVE:s\[[0-9]+:[0-9]+\]]], [[ORIG]], exec
				; GCN: ; %.continue0
				; GCN: v_cndmask_b32_e64 [[DST:v[0-9]+]], 1.0, 0, [[LIVE]]
				; GCN-32: s_and_b32 exec_lo, exec_lo, [[LIVE]]
				; GCN-64: s_and_b64 exec, exec, [[LIVE]]
				; GCN: ; %.demote1
				; GCN-32-NEXT: s_mov_b32 exec_lo, 0
				; GCN-64-NEXT: s_mov_b64 exec, 0
				; GCN: ; %.continue1
				; GCN: exp mrt0
				define amdgpu_ps void @wqm_deriv(<2 x float> %input, float %arg, i32 %index) {
				.entry:
				%p0 = extractelement <2 x float> %input, i32 0
				%p1 = extractelement <2 x float> %input, i32 1
				%x0 = call float @llvm.amdgcn.interp.p1(float %p0, i32 immarg 0, i32 immarg 0, i32 %index) #2
				%x1 = call float @llvm.amdgcn.interp.p2(float %x0, float %p1, i32 immarg 0, i32 immarg 0, i32 %index) #2
				%argi = fptosi float %arg to i32
				%cond0 = icmp eq i32 %argi, 0
				br i1 %cond0, label %.continue0, label %.demote0

				.demote0:
				call void @llvm.amdgcn.wqm.demote(i1 false)
				br label %.continue0

				.continue0:
				%live = call i1 @llvm.amdgcn.wqm.helper()
				%live.cond = select i1 %live, i32 0, i32 1065353216
				%live.v0 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %live.cond, i32 85, i32 15, i32 15, i1 true)
				%live.v0f = bitcast i32 %live.v0 to float
				%live.v1 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %live.cond, i32 0, i32 15, i32 15, i1 true)
				%live.v1f = bitcast i32 %live.v1 to float
				%v0 = fsub float %live.v0f, %live.v1f
				%v0.wqm = call float @llvm.amdgcn.wqm.f32(float %v0)
				%cond1 = fcmp oeq float %v0.wqm, 0.000000e+00
				%cond2 = and i1 %live, %cond1
				br i1 %cond2, label %.continue1, label %.demote1

				.demote1:
				call void @llvm.amdgcn.wqm.demote(i1 false)
				br label %.continue1

				.continue1:
				call void @llvm.amdgcn.exp.compr.v2f16(i32 immarg 0, i32 immarg 15, <2 x half> <half 0xH3C00, half 0xH0000>, <2 x half> <half 0xH0000, half 0xH3C00>, i1 immarg true, i1 immarg true) #3
				ret void
				}

				; GCN-LABEL: {{^}}wqm_deriv_loop:
				; GCN-NEXT: ; %.entry
				; GCN-32: s_mov_b32 [[ORIG:s[0-9]+]], exec_lo
				; GCN-64: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				; GCN-32: s_wqm_b32 exec_lo, exec_lo
				; GCN-64: s_wqm_b64 exec, exec
				; GCN: ; %.demote0
				; GCN-32-NEXT: s_andn2_b32 [[LIVE:s[0-9]+]], [[ORIG]], exec
				; GCN-64-NEXT: s_andn2_b64 [[LIVE:s\[[0-9]+:[0-9]+\]]], [[ORIG]], exec
				; GCN: ; %.continue0.preheader
				; GCN: ; %.continue0
				; GCN: v_cndmask_b32_e64 [[DST:v[0-9]+]], [[SRC:v[0-9]+]], 0, [[LIVE]]
				; GCN: ; %.demote1
				; GCN-32: s_andn2_b32 [[LIVE]], [[LIVE]], exec
				; GCN-64: s_andn2_b64 [[LIVE]], [[LIVE]], exec
				; GCN: ; %.return
				; GCN-32: s_and_b32 exec_lo, exec_lo, [[LIVE]]
				; GCN-64: s_and_b64 exec, exec, [[LIVE]]
				; GCN: exp mrt0
				define amdgpu_ps void @wqm_deriv_loop(<2 x float> %input, float %arg, i32 %index, i32 %limit) {
				.entry:
				%p0 = extractelement <2 x float> %input, i32 0
				%p1 = extractelement <2 x float> %input, i32 1
				%x0 = call float @llvm.amdgcn.interp.p1(float %p0, i32 immarg 0, i32 immarg 0, i32 %index) #2
				%x1 = call float @llvm.amdgcn.interp.p2(float %x0, float %p1, i32 immarg 0, i32 immarg 0, i32 %index) #2
				%argi = fptosi float %arg to i32
				%cond0 = icmp eq i32 %argi, 0
				br i1 %cond0, label %.continue0, label %.demote0

				.demote0:
				call void @llvm.amdgcn.wqm.demote(i1 false)
				br label %.continue0

				.continue0:
				%count = phi i32 [ 0, %.entry ], [ 0, %.demote0 ], [ %next, %.continue1 ]
				%live = call i1 @llvm.amdgcn.wqm.helper()
				%live.cond = select i1 %live, i32 0, i32 %count
				%live.v0 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %live.cond, i32 85, i32 15, i32 15, i1 true)
				%live.v0f = bitcast i32 %live.v0 to float
				%live.v1 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %live.cond, i32 0, i32 15, i32 15, i1 true)
				%live.v1f = bitcast i32 %live.v1 to float
				%v0 = fsub float %live.v0f, %live.v1f
				%v0.wqm = call float @llvm.amdgcn.wqm.f32(float %v0)
				%cond1 = fcmp oeq float %v0.wqm, 0.000000e+00
				%cond2 = and i1 %live, %cond1
				br i1 %cond2, label %.continue1, label %.demote1

				.demote1:
				call void @llvm.amdgcn.wqm.demote(i1 false)
				br label %.continue1

				.continue1:
				%next = add i32 %count, 1
				%loop.cond = icmp slt i32 %next, %limit
				br i1 %loop.cond, label %.continue0, label %.return

				.return:
				call void @llvm.amdgcn.exp.compr.v2f16(i32 immarg 0, i32 immarg 15, <2 x half> <half 0xH3C00, half 0xH0000>, <2 x half> <half 0xH0000, half 0xH3C00>, i1 immarg true, i1 immarg true) #3
				ret void
				}

				declare void @llvm.amdgcn.wqm.demote(i1) #0
				declare i1 @llvm.amdgcn.wqm.helper() #0
				declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0
				declare <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
				declare float @llvm.amdgcn.wqm.f32(float) #1
				declare float @llvm.amdgcn.interp.p1(float, i32 immarg, i32 immarg, i32) #2
				declare float @llvm.amdgcn.interp.p2(float, float, i32 immarg, i32 immarg, i32) #2
				declare void @llvm.amdgcn.exp.compr.v2f16(i32 immarg, i32 immarg, <2 x half>, <2 x half>, i1 immarg, i1 immarg) #3
				declare i32 @llvm.amdgcn.mov.dpp.i32(i32, i32 immarg, i32 immarg, i32 immarg, i1 immarg) #4

				attributes #0 = { nounwind }
				attributes #1 = { nounwind readnone }
				attributes #2 = { nounwind readnone speculatable }
				attributes #3 = { inaccessiblememonly nounwind }
				attributes #4 = { convergent nounwind readnone }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic and live mask tracking
AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 220868

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td

llvm/lib/Target/AMDGPU/SIInsertSkips.cpp

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic and live mask trackingAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 220868

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td

llvm/lib/Target/AMDGPU/SIInsertSkips.cpp

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll

[AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic and live mask tracking
AbandonedPublic