This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add an llvm.amdgcn.wqm intrinsic for WQM
ClosedPublic

Authored by cwabbott on Jul 7 2017, 6:43 PM.

Download Raw Diff

Details

Reviewers

arsenm
tpr
nhaehnle

Commits

rG8c217d0a2959: [AMDGPU] Add an llvm.amdgcn.wqm intrinsic for WQM
rL310085: [AMDGPU] Add an llvm.amdgcn.wqm intrinsic for WQM

Summary

Previously, we assumed that certain types of instructions needed WQM in
pixel shaders, particularly DS instructions and image sampling
instructions. This was ok because with OpenGL, the assumption was
correct. But we want to start using DPP instructions for derivatives as
well as other things, so the assumption that we can infer whether to use
WQM based on the instruction won't continue to hold. This intrinsic lets
frontends like Mesa indicate what things need WQM based on their
knowledge of the API, rather than second-guessing them in the backend.
We need to keep around the old method of enabling WQM, but eventually we
should remove it once Mesa catches up. For now, this will let us use DPP
instructions for computing derivatives correctly.

Diff Detail

Build Status

Buildable 8630
Build 8630: arc lint + arc unit

Event Timeline

cwabbott created this revision.Jul 7 2017, 6:43 PM

Herald added subscribers: t-tye, dstuttard, yaxunl and 2 others. · View Herald TranscriptJul 7 2017, 6:43 PM

cwabbott mentioned this in D34677: [AMDGPU] Whole Quad Mode variant of mov.dpp intrinsic.Jul 7 2017, 6:49 PM

Harbormaster completed remote builds in B8073: Diff 105732.Jul 9 2017, 6:15 AM

Avoid illegal SGPR<->VGPR copies by treating WQM more like COPY, add test that
exposes the problem.

cwabbott added a child revision: D35523: [AMDGPU] refactor WQM pass in preparation for WWM (NFCI).Jul 17 2017, 5:59 PM

tpr accepted this revision.Jul 19 2017, 9:22 AM

This revision is now accepted and ready to land.Jul 19 2017, 9:22 AM

Some minor comments. In addition, I think it can be simplified, and we probably want the intrinsic to be convergent, because sinking WQM computations into a non-uniform branch could mean that the computation becomes non-WQM for practical purposes.

include/llvm/IR/IntrinsicsAMDGPU.td
744–748	I believe this should be convergent, due to the way neighboring lanes may disappear due to control flow.
lib/Target/AMDGPU/SIInstructions.td
121	I believe this should have a let isConvergent = 1, due to the way neighboring lanes could "disappear" with additional control flow.
lib/Target/AMDGPU/SIWholeQuadMode.cpp
300	Capitalize the comment.
676–688	You can probably use MI->setDesc for this.

This revision now requires changes to proceed.Jul 20 2017, 8:45 AM

cwabbott added inline comments.Jul 20 2017, 1:14 PM

include/llvm/IR/IntrinsicsAMDGPU.td
744–748	Hmm, I'm not really convinced. All this intrinsic does is to guarantee something about how its source value is computed, which obviously won't change if the instruction itself is moved around. The operation itself is a simple move operation, which normally isn't convergent. Can you give me an example where adding a control dependency to the WQM intrinsic causes problems?

cwabbott added inline comments.Jul 26 2017, 3:28 PM

lib/Target/AMDGPU/SIWholeQuadMode.cpp
676–688	It's not quite that simple, since I'm also using this code to optimize llvm.amdgcn.set.inactive with an undef second argument, in which case we need to get rid of the second (undef) argument. But I think the end-result is still a little shorter and otherwise equivalent, so I'll change it.

Minor comment style fixes, simplify lowerCopyInstrs by using setDesc()

cwabbott marked 2 inline comments as done.Jul 26 2017, 3:33 PM

Thanks. Assuming we agree on the derivative calculations logic I added in a comment, this LGTM.

include/llvm/IR/IntrinsicsAMDGPU.td
744–748	How about derivative calculations where the result is only used in a subsequent if-block? Basically, we need to prevent %deriv.0 = derivative calculations %deriv = llvm.amdgcn.wqm(%deriv.0) if (cc) { only_use_of(%deriv) } being sunk into if (cc) { %deriv.0 = derivative calculations %deriv = llvm.amdgcn.wqm(%deriv.0) only_use_of(%deriv) } Although, on second thought, I guess all the cross-lane operations involved in computing %deriv.0 are already convergent? So I guess it's fine in the end...

This revision is now accepted and ready to land.Aug 2 2017, 2:15 AM

cwabbott added inline comments.Aug 2 2017, 12:16 PM

include/llvm/IR/IntrinsicsAMDGPU.td
744–748	Yes, in this case we should still be fine. The way think about it is that llvm.amdgcn.wqm merely guarantees that its source have their helper lanes computed correctly; making sure the correct helper lanes are enabled when computing the source is up to the source computations themselves. So, it seems like it should be up to the uses of llvm.amdgcn.wqm to be marked as convergent if necessary.

Closed by commit rL310085: [AMDGPU] Add an llvm.amdgcn.wqm intrinsic for WQM (authored by cwabbott). · Explain WhyAug 4 2017, 11:37 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

7 lines

lib/

Target/

AMDGPU/

6 lines

5 lines

2 lines

5 lines

15 lines

test/

CodeGen/

AMDGPU/

wqm.ll

36 lines

Diff 108372

include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 734 Lines • ▼ Show 20 Lines	def int_amdgcn_alignbit : Intrinsic<[llvm_i32_ty],
[IntrNoMem, IntrSpeculatable]		[IntrNoMem, IntrSpeculatable]
>;		>;

def int_amdgcn_alignbyte : Intrinsic<[llvm_i32_ty],		def int_amdgcn_alignbyte : Intrinsic<[llvm_i32_ty],
[llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],		[llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
[IntrNoMem, IntrSpeculatable]		[IntrNoMem, IntrSpeculatable]
>;		>;


		// Copies the source value to the destination value, with the guarantee that
		// the source value is computed as if the entire program were executed in WQM.
		def int_amdgcn_wqm : Intrinsic<[llvm_any_ty],
		[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable]
		>;
		nhaehnleUnsubmitted Not Done Reply Inline Actions I believe this should be convergent, due to the way neighboring lanes may disappear due to control flow. nhaehnle: I believe this should be convergent, due to the way neighboring lanes may disappear due to…
		cwabbottAuthorUnsubmitted Not Done Reply Inline Actions Hmm, I'm not really convinced. All this intrinsic does is to guarantee something about how its source value is computed, which obviously won't change if the instruction itself is moved around. The operation itself is a simple move operation, which normally isn't convergent. Can you give me an example where adding a control dependency to the WQM intrinsic causes problems? cwabbott: Hmm, I'm not really convinced. All this intrinsic does is to guarantee something about how its…
		nhaehnleUnsubmitted Not Done Reply Inline Actions How about derivative calculations where the result is only used in a subsequent if-block? Basically, we need to prevent %deriv.0 = derivative calculations %deriv = llvm.amdgcn.wqm(%deriv.0) if (cc) { only_use_of(%deriv) } being sunk into if (cc) { %deriv.0 = derivative calculations %deriv = llvm.amdgcn.wqm(%deriv.0) only_use_of(%deriv) } Although, on second thought, I guess all the cross-lane operations involved in computing %deriv.0 are already convergent? So I guess it's fine in the end... nhaehnle: How about derivative calculations where the result is only used in a subsequent if-block?
		cwabbottAuthorUnsubmitted Not Done Reply Inline Actions Yes, in this case we should still be fine. The way think about it is that llvm.amdgcn.wqm merely guarantees that its source have their helper lanes computed correctly; making sure the correct helper lanes are enabled when computing the source is up to the source computations themselves. So, it seems like it should be up to the uses of llvm.amdgcn.wqm to be marked as convergent if necessary. cwabbott: Yes, in this case we should still be fine. The way think about it is that llvm.amdgcn.wqm…

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// CI+ Intrinsics		// CI+ Intrinsics
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def int_amdgcn_s_dcache_inv_vol :		def int_amdgcn_s_dcache_inv_vol :
GCCBuiltin<"__builtin_amdgcn_s_dcache_inv_vol">,		GCCBuiltin<"__builtin_amdgcn_s_dcache_inv_vol">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;

▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

Show First 20 Lines • Show All 332 Lines • ▼ Show 20 Lines
}		}

static bool isSafeToFoldImmIntoCopy(const MachineInstr *Copy,		static bool isSafeToFoldImmIntoCopy(const MachineInstr *Copy,
const MachineInstr *MoveImm,		const MachineInstr *MoveImm,
const SIInstrInfo *TII,		const SIInstrInfo *TII,
unsigned &SMovOp,		unsigned &SMovOp,
int64_t &Imm) {		int64_t &Imm) {

		if (Copy->getOpcode() != AMDGPU::COPY)
		return false;

if (!MoveImm->isMoveImmediate())		if (!MoveImm->isMoveImmediate())
return false;		return false;

const MachineOperand *ImmOp =		const MachineOperand *ImmOp =
TII->getNamedOperand(*MoveImm, AMDGPU::OpName::src0);		TII->getNamedOperand(*MoveImm, AMDGPU::OpName::src0);
if (!ImmOp->isImm())		if (!ImmOp->isImm())
return false;		return false;

▲ Show 20 Lines • Show All 210 Lines • ▼ Show 20 Lines	for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
MachineBasicBlock &MBB = *BI;		MachineBasicBlock &MBB = *BI;
for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();		for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
I != E; ++I) {		I != E; ++I) {
MachineInstr &MI = *I;		MachineInstr &MI = *I;

switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
default:		default:
continue;		continue;
case AMDGPU::COPY: {		case AMDGPU::COPY:
		case AMDGPU::WQM: {
// If the destination register is a physical register there isn't really		// If the destination register is a physical register there isn't really
// much we can do to fix this.		// much we can do to fix this.
if (!TargetRegisterInfo::isVirtualRegister(MI.getOperand(0).getReg()))		if (!TargetRegisterInfo::isVirtualRegister(MI.getOperand(0).getReg()))
continue;		continue;

const TargetRegisterClass SrcRC, DstRC;		const TargetRegisterClass SrcRC, DstRC;
std::tie(SrcRC, DstRC) = getCopyRegClasses(MI, *TRI, MRI);		std::tie(SrcRC, DstRC) = getCopyRegClasses(MI, *TRI, MRI);
if (isVGPRToSGPRCopy(SrcRC, DstRC, *TRI)) {		if (isVGPRToSGPRCopy(SrcRC, DstRC, *TRI)) {
▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

Show First 20 Lines • Show All 3,284 Lines • ▼ Show 20 Lines	return DAG.getNode(AMDGPUISD::BFE_U32, DL, VT,
Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));		Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
case Intrinsic::amdgcn_cvt_pkrtz: {		case Intrinsic::amdgcn_cvt_pkrtz: {
// FIXME: Stop adding cast if v2f16 legal.		// FIXME: Stop adding cast if v2f16 legal.
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
SDValue Node = DAG.getNode(AMDGPUISD::CVT_PKRTZ_F16_F32, DL, MVT::i32,		SDValue Node = DAG.getNode(AMDGPUISD::CVT_PKRTZ_F16_F32, DL, MVT::i32,
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
return DAG.getNode(ISD::BITCAST, DL, VT, Node);		return DAG.getNode(ISD::BITCAST, DL, VT, Node);
}		}
		case Intrinsic::amdgcn_wqm: {
		SDValue Src = Op.getOperand(1);
		return SDValue(DAG.getMachineNode(AMDGPU::WQM, DL, Src.getValueType(), Src),
		0);
		}
default:		default:
return Op;		return Op;
}		}
}		}

SDValue SITargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,		SDValue SITargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
unsigned IntrID = cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue();		unsigned IntrID = cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue();
▲ Show 20 Lines • Show All 2,443 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 2,647 Lines • ▼ Show 20 Lines

unsigned SIInstrInfo::getVALUOp(const MachineInstr &MI) {		unsigned SIInstrInfo::getVALUOp(const MachineInstr &MI) {
switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
default: return AMDGPU::INSTRUCTION_LIST_END;		default: return AMDGPU::INSTRUCTION_LIST_END;
case AMDGPU::REG_SEQUENCE: return AMDGPU::REG_SEQUENCE;		case AMDGPU::REG_SEQUENCE: return AMDGPU::REG_SEQUENCE;
case AMDGPU::COPY: return AMDGPU::COPY;		case AMDGPU::COPY: return AMDGPU::COPY;
case AMDGPU::PHI: return AMDGPU::PHI;		case AMDGPU::PHI: return AMDGPU::PHI;
case AMDGPU::INSERT_SUBREG: return AMDGPU::INSERT_SUBREG;		case AMDGPU::INSERT_SUBREG: return AMDGPU::INSERT_SUBREG;
		case AMDGPU::WQM: return AMDGPU::WQM;
case AMDGPU::S_MOV_B32:		case AMDGPU::S_MOV_B32:
return MI.getOperand(1).isReg() ?		return MI.getOperand(1).isReg() ?
AMDGPU::COPY : AMDGPU::V_MOV_B32_e32;		AMDGPU::COPY : AMDGPU::V_MOV_B32_e32;
case AMDGPU::S_ADD_I32:		case AMDGPU::S_ADD_I32:
case AMDGPU::S_ADD_U32: return AMDGPU::V_ADD_I32_e32;		case AMDGPU::S_ADD_U32: return AMDGPU::V_ADD_I32_e32;
case AMDGPU::S_ADDC_U32: return AMDGPU::V_ADDC_U32_e32;		case AMDGPU::S_ADDC_U32: return AMDGPU::V_ADDC_U32_e32;
case AMDGPU::S_SUB_I32:		case AMDGPU::S_SUB_I32:
case AMDGPU::S_SUB_U32: return AMDGPU::V_SUB_I32_e32;		case AMDGPU::S_SUB_U32: return AMDGPU::V_SUB_I32_e32;
▲ Show 20 Lines • Show All 1,288 Lines • ▼ Show 20 Lines	const TargetRegisterClass *SIInstrInfo::getDestEquivalentVGPRClass(
switch (Inst.getOpcode()) {		switch (Inst.getOpcode()) {
// For target instructions, getOpRegClass just returns the virtual register		// For target instructions, getOpRegClass just returns the virtual register
// class associated with the operand, so we need to find an equivalent VGPR		// class associated with the operand, so we need to find an equivalent VGPR
// register class in order to move the instruction to the VALU.		// register class in order to move the instruction to the VALU.
case AMDGPU::COPY:		case AMDGPU::COPY:
case AMDGPU::PHI:		case AMDGPU::PHI:
case AMDGPU::REG_SEQUENCE:		case AMDGPU::REG_SEQUENCE:
case AMDGPU::INSERT_SUBREG:		case AMDGPU::INSERT_SUBREG:
		case AMDGPU::WQM:
if (RI.hasVGPRs(NewDstRC))		if (RI.hasVGPRs(NewDstRC))
return nullptr;		return nullptr;

NewDstRC = RI.getEquivalentVGPRClass(NewDstRC);		NewDstRC = RI.getEquivalentVGPRClass(NewDstRC);
if (!NewDstRC)		if (!NewDstRC)
return nullptr;		return nullptr;
return NewDstRC;		return NewDstRC;
default:		default:
▲ Show 20 Lines • Show All 390 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	def V_CNDMASK_B64_PSEUDO : VOP3Common <(outs VReg_64:$vdst),
let isCodeGenOnly = 1;		let isCodeGenOnly = 1;
let usesCustomInserter = 1;		let usesCustomInserter = 1;
}		}

// 64-bit vector move instruction. This is mainly used by the SIFoldOperands		// 64-bit vector move instruction. This is mainly used by the SIFoldOperands
// pass to enable folding of inline immediates.		// pass to enable folding of inline immediates.
def V_MOV_B64_PSEUDO : VPseudoInstSI <(outs VReg_64:$vdst),		def V_MOV_B64_PSEUDO : VPseudoInstSI <(outs VReg_64:$vdst),
(ins VSrc_b64:$src0)>;		(ins VSrc_b64:$src0)>;

		// Pseudoinstruction for @llvm.amdgcn.wqm. It is turned into a copy
		// after the WQM pass processes them.
		def WQM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;
		nhaehnleUnsubmitted Not Done Reply Inline Actions I believe this should have a let isConvergent = 1, due to the way neighboring lanes could "disappear" with additional control flow. nhaehnle: I believe this should have a let isConvergent = 1, due to the way neighboring lanes could…

} // End let hasSideEffects = 0, mayLoad = 0, mayStore = 0, Uses = [EXEC]		} // End let hasSideEffects = 0, mayLoad = 0, mayStore = 0, Uses = [EXEC]

let usesCustomInserter = 1, SALU = 1 in {		let usesCustomInserter = 1, SALU = 1 in {
def GET_GROUPSTATICSIZE : PseudoInstSI <(outs SReg_32:$sdst), (ins),		def GET_GROUPSTATICSIZE : PseudoInstSI <(outs SReg_32:$sdst), (ins),
[(set SReg_32:$sdst, (int_amdgcn_groupstaticsize))]>;		[(set SReg_32:$sdst, (int_amdgcn_groupstaticsize))]>;
} // End let usesCustomInserter = 1, SALU = 1		} // End let usesCustomInserter = 1, SALU = 1

def S_MOV_B64_term : PseudoInstSI<(outs SReg_64:$dst),		def S_MOV_B64_term : PseudoInstSI<(outs SReg_64:$dst),
▲ Show 20 Lines • Show All 1,182 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIWholeQuadMode.cpp

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	private:
const SIInstrInfo *TII;		const SIInstrInfo *TII;
const SIRegisterInfo *TRI;		const SIRegisterInfo *TRI;
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
LiveIntervals *LIS;		LiveIntervals *LIS;

DenseMap<const MachineInstr *, InstrInfo> Instructions;		DenseMap<const MachineInstr *, InstrInfo> Instructions;
DenseMap<MachineBasicBlock *, BlockInfo> Blocks;		DenseMap<MachineBasicBlock *, BlockInfo> Blocks;
SmallVector<MachineInstr *, 1> LiveMaskQueries;		SmallVector<MachineInstr *, 1> LiveMaskQueries;
		SmallVector<MachineInstr *, 4> LowerToCopyInstrs;

void printInfo();		void printInfo();

void markInstruction(MachineInstr &MI, char Flag,		void markInstruction(MachineInstr &MI, char Flag,
std::vector<WorkItem> &Worklist);		std::vector<WorkItem> &Worklist);
void markUsesWQM(const MachineInstr &MI, std::vector<WorkItem> &Worklist);		void markUsesWQM(const MachineInstr &MI, std::vector<WorkItem> &Worklist);
char scanInstructions(MachineFunction &MF, std::vector<WorkItem> &Worklist);		char scanInstructions(MachineFunction &MF, std::vector<WorkItem> &Worklist);
void propagateInstruction(MachineInstr &MI, std::vector<WorkItem> &Worklist);		void propagateInstruction(MachineInstr &MI, std::vector<WorkItem> &Worklist);
Show All 10 Lines	prepareInsertion(MachineBasicBlock &MBB, MachineBasicBlock::iterator First,
bool SaveSCC);		bool SaveSCC);
void toExact(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void toExact(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SaveWQM, unsigned LiveMaskReg);		unsigned SaveWQM, unsigned LiveMaskReg);
void toWQM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,		void toWQM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
unsigned SavedWQM);		unsigned SavedWQM);
void processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg, bool isEntry);		void processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg, bool isEntry);

void lowerLiveMaskQueries(unsigned LiveMaskReg);		void lowerLiveMaskQueries(unsigned LiveMaskReg);
		void lowerCopyInstrs();

public:		public:
static char ID;		static char ID;

SIWholeQuadMode() :		SIWholeQuadMode() :
MachineFunctionPass(ID) { }		MachineFunctionPass(ID) { }

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
Flags = StateWQM;		Flags = StateWQM;
} else if (TII->isWQM(Opcode)) {		} else if (TII->isWQM(Opcode)) {
// Sampling instructions don't need to produce results for all pixels		// Sampling instructions don't need to produce results for all pixels
// in a quad, they just require all inputs of a quad to have been		// in a quad, they just require all inputs of a quad to have been
// computed for derivatives.		// computed for derivatives.
markUsesWQM(MI, Worklist);		markUsesWQM(MI, Worklist);
GlobalFlags \|= StateWQM;		GlobalFlags \|= StateWQM;
continue;		continue;
		} else if (Opcode == AMDGPU::WQM) {
		// The WQM intrinsic requires its output to have all the helper lanes
		nhaehnleUnsubmitted Done Reply Inline Actions Capitalize the comment. nhaehnle: Capitalize the comment.
		// correct, so we need it to be in WQM.
		Flags = StateWQM;
		LowerToCopyInstrs.push_back(&MI);
} else if (TII->isDisableWQM(MI)) {		} else if (TII->isDisableWQM(MI)) {
Flags = StateExact;		Flags = StateExact;
} else {		} else {
if (Opcode == AMDGPU::SI_PS_LIVE) {		if (Opcode == AMDGPU::SI_PS_LIVE) {
LiveMaskQueries.push_back(&MI);		LiveMaskQueries.push_back(&MI);
} else if (WQMOutputs) {		} else if (WQMOutputs) {
// The function is in machine SSA form, which means that physical		// The function is in machine SSA form, which means that physical
// VGPRs correspond to shader inputs and outputs. Inputs are		// VGPRs correspond to shader inputs and outputs. Inputs are
▲ Show 20 Lines • Show All 356 Lines • ▼ Show 20 Lines	MachineInstr *Copy =
BuildMI(*MI->getParent(), MI, DL, TII->get(AMDGPU::COPY), Dest)		BuildMI(*MI->getParent(), MI, DL, TII->get(AMDGPU::COPY), Dest)
.addReg(LiveMaskReg);		.addReg(LiveMaskReg);

LIS->ReplaceMachineInstrInMaps(MI, Copy);		LIS->ReplaceMachineInstrInMaps(MI, Copy);
MI->eraseFromParent();		MI->eraseFromParent();
}		}
}		}

		void SIWholeQuadMode::lowerCopyInstrs() {
		for (MachineInstr *MI : LowerToCopyInstrs)
		MI->setDesc(TII->get(AMDGPU::COPY));
		}

bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {		bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {
if (MF.getFunction()->getCallingConv() != CallingConv::AMDGPU_PS)		if (MF.getFunction()->getCallingConv() != CallingConv::AMDGPU_PS)
return false;		return false;

Instructions.clear();		Instructions.clear();
Blocks.clear();		Blocks.clear();
LiveMaskQueries.clear();		LiveMaskQueries.clear();
		LowerToCopyInstrs.clear();
		nhaehnleUnsubmitted Done Reply Inline Actions You can probably use MI->setDesc for this. nhaehnle: You can probably use MI->setDesc for this.
		cwabbottAuthorUnsubmitted Not Done Reply Inline Actions It's not quite that simple, since I'm also using this code to optimize llvm.amdgcn.set.inactive with an undef second argument, in which case we need to get rid of the second (undef) argument. But I think the end-result is still a little shorter and otherwise equivalent, so I'll change it. cwabbott: It's not quite that simple, since I'm also using this code to optimize llvm.amdgcn.set.inactive…

const SISubtarget &ST = MF.getSubtarget<SISubtarget>();		const SISubtarget &ST = MF.getSubtarget<SISubtarget>();

TII = ST.getInstrInfo();		TII = ST.getInstrInfo();
TRI = &TII->getRegisterInfo();		TRI = &TII->getRegisterInfo();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
LIS = &getAnalysis<LiveIntervals>();		LIS = &getAnalysis<LiveIntervals>();

Show All 19 Lines	unsigned LiveMaskReg = 0;

if (GlobalFlags == StateWQM) {		if (GlobalFlags == StateWQM) {
// For a shader that needs only WQM, we can just set it once.		// For a shader that needs only WQM, we can just set it once.
BuildMI(Entry, EntryMI, DebugLoc(), TII->get(AMDGPU::S_WQM_B64),		BuildMI(Entry, EntryMI, DebugLoc(), TII->get(AMDGPU::S_WQM_B64),
AMDGPU::EXEC)		AMDGPU::EXEC)
.addReg(AMDGPU::EXEC);		.addReg(AMDGPU::EXEC);

lowerLiveMaskQueries(LiveMaskReg);		lowerLiveMaskQueries(LiveMaskReg);
		lowerCopyInstrs();
// EntryMI may become invalid here		// EntryMI may become invalid here
return true;		return true;
}		}
}		}

DEBUG(printInfo());		DEBUG(printInfo());

lowerLiveMaskQueries(LiveMaskReg);		lowerLiveMaskQueries(LiveMaskReg);
		lowerCopyInstrs();

// Handle the general case		// Handle the general case
for (auto BII : Blocks)		for (auto BII : Blocks)
processBlock(BII.first, LiveMaskReg, BII.first == &MF.begin());		processBlock(BII.first, LiveMaskReg, BII.first == &MF.begin());

// Physical registers like SCC aren't tracked by default anyway, so just		// Physical registers like SCC aren't tracked by default anyway, so just
// removing the ranges we computed is the simplest option for maintaining		// removing the ranges we computed is the simplest option for maintaining
// the analysis results.		// the analysis results.
LIS->removeRegUnit(*MCRegUnitIterator(AMDGPU::SCC, TRI));		LIS->removeRegUnit(*MCRegUnitIterator(AMDGPU::SCC, TRI));

return true;		return true;
}		}

test/CodeGen/AMDGPU/wqm.ll

Show First 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	main_body:

call void @llvm.amdgcn.buffer.store.v4f32(<4 x float> undef, <4 x i32> undef, i32 %c.1, i32 0, i1 0, i1 0)		call void @llvm.amdgcn.buffer.store.v4f32(<4 x float> undef, <4 x i32> undef, i32 %c.1, i32 0, i1 0, i1 0)
%c.1.bc = bitcast i32 %c.1 to float		%c.1.bc = bitcast i32 %c.1 to float
%tex = call <4 x float> @llvm.amdgcn.image.sample.v4f32.f32.v8i32(float %c.1.bc, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i1 false, i1 false, i1 false, i1 false, i1 false) #0		%tex = call <4 x float> @llvm.amdgcn.image.sample.v4f32.f32.v8i32(float %c.1.bc, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i1 false, i1 false, i1 false, i1 false, i1 false) #0
%dtex = call <4 x float> @llvm.amdgcn.image.sample.v4f32.v4f32.v8i32(<4 x float> %tex, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i1 false, i1 false, i1 false, i1 false, i1 false) #0		%dtex = call <4 x float> @llvm.amdgcn.image.sample.v4f32.v4f32.v8i32(<4 x float> %tex, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i1 false, i1 false, i1 false, i1 false, i1 false) #0
ret <4 x float> %dtex		ret <4 x float> %dtex
}		}

		; Check that WQM is triggered by the wqm intrinsic.
		;
		;CHECK-LABEL: {{^}}test5:
		;CHECK: s_wqm_b64 exec, exec
		;CHECK: buffer_load_dword
		;CHECK: buffer_load_dword
		;CHECK: v_add_f32_e32
		define amdgpu_ps float @test5(i32 inreg %idx0, i32 inreg %idx1) {
		main_body:
		%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
		%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
		%out = fadd float %src0, %src1
		%out.0 = call float @llvm.amdgcn.wqm.f32(float %out)
		ret float %out.0
		}

		; Check that the wqm intrinsic works correctly for integers.
		;
		;CHECK-LABEL: {{^}}test6:
		;CHECK: s_wqm_b64 exec, exec
		;CHECK: buffer_load_dword
		;CHECK: buffer_load_dword
		;CHECK: v_add_f32_e32
		define amdgpu_ps float @test6(i32 inreg %idx0, i32 inreg %idx1) {
		main_body:
		%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
		%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
		%out = fadd float %src0, %src1
		%out.0 = bitcast float %out to i32
		%out.1 = call i32 @llvm.amdgcn.wqm.i32(i32 %out.0)
		%out.2 = bitcast i32 %out.1 to float
		ret float %out.2
		}

; Check a case of one branch of an if-else requiring WQM, the other requiring		; Check a case of one branch of an if-else requiring WQM, the other requiring
; exact.		; exact.
;		;
; Note: In this particular case, the save-and-restore could be avoided if the		; Note: In this particular case, the save-and-restore could be avoided if the
; analysis understood that the two branches of the if-else are mutually		; analysis understood that the two branches of the if-else are mutually
; exclusive.		; exclusive.
;		;
;CHECK-LABEL: {{^}}test_control_flow_0:		;CHECK-LABEL: {{^}}test_control_flow_0:
▲ Show 20 Lines • Show All 404 Lines • ▼ Show 20 Lines
declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #2		declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #2
declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #2		declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #2
declare <4 x float> @llvm.amdgcn.image.load.v4f32.v4i32.v8i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #3		declare <4 x float> @llvm.amdgcn.image.load.v4f32.v4i32.v8i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #3
declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #3		declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #3
declare <4 x float> @llvm.amdgcn.image.sample.v4f32.f32.v8i32(float, <8 x i32>, <4 x i32>, i32, i1, i1, i1, i1, i1) #3		declare <4 x float> @llvm.amdgcn.image.sample.v4f32.f32.v8i32(float, <8 x i32>, <4 x i32>, i32, i1, i1, i1, i1, i1) #3
declare <4 x float> @llvm.amdgcn.image.sample.v4f32.v2f32.v8i32(<2 x float>, <8 x i32>, <4 x i32>, i32, i1, i1, i1, i1, i1) #3		declare <4 x float> @llvm.amdgcn.image.sample.v4f32.v2f32.v8i32(<2 x float>, <8 x i32>, <4 x i32>, i32, i1, i1, i1, i1, i1) #3
declare <4 x float> @llvm.amdgcn.image.sample.v4f32.v4f32.v8i32(<4 x float>, <8 x i32>, <4 x i32>, i32, i1, i1, i1, i1, i1) #3		declare <4 x float> @llvm.amdgcn.image.sample.v4f32.v4f32.v8i32(<4 x float>, <8 x i32>, <4 x i32>, i32, i1, i1, i1, i1, i1) #3
declare void @llvm.AMDGPU.kill(float) #1		declare void @llvm.AMDGPU.kill(float) #1
		declare float @llvm.amdgcn.wqm.f32(float) #3
		declare i32 @llvm.amdgcn.wqm.i32(i32) #3

attributes #1 = { nounwind }		attributes #1 = { nounwind }
attributes #2 = { nounwind readonly }		attributes #2 = { nounwind readonly }
attributes #3 = { nounwind readnone }		attributes #3 = { nounwind readnone }
attributes #4 = { "amdgpu-ps-wqm-outputs" }		attributes #4 = { "amdgpu-ps-wqm-outputs" }