This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsAMDGPU.td
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
1/4
SIISelLowering.cpp
-
SIInstrInfo.cpp
1/1
SIInstructions.td
-
SIWholeQuadMode.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
llvm.amdgcn.softwqm.ll

Differential D64935

[AMDGPU] Add llvm.amdgcn.softwqm intrinsic
ClosedPublic

Authored by critson on Jul 18 2019, 10:05 AM.

Download Raw Diff

Details

Reviewers

nhaehnle
tpr

Commits

rG00e89b428b99: [AMDGPU] Add llvm.amdgcn.softwqm intrinsic
rL367097: [AMDGPU] Add llvm.amdgcn.softwqm intrinsic

Summary

Add llvm.amdgcn.softwqm intrinsic which behaves like llvm.amdgcn.wqm only if there is other WQM computation in the shader.

Diff Detail

Event Timeline

critson created this revision.Jul 18 2019, 10:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2019, 10:05 AM

Herald added subscribers: llvm-commits, t-tye, dstuttard and 5 others. · View Herald Transcript

arsenm added inline comments.Jul 18 2019, 10:16 AM

lib/Target/AMDGPU/SIISelLowering.cpp
5955–5959	Is there some reason you can't just handle this with an instruction pattern?

critson marked an inline comment as done.Jul 19 2019, 3:25 AM

critson added inline comments.

lib/Target/AMDGPU/SIISelLowering.cpp
5955–5959	For the same reason as llvm.amdgcn.wqm, we don't specify the input and output types. Happy to be corrected, but I don't think there is a way to have a single instruction pattern covering all types.

Have you checked that this actually fixes the reported CTS failure?

IIRC the CTS failure was essentially due to a shader of the form:

derivative calculation here
subgroup operation here

The derivative calculation enables WQM, but then we may leave WQM again for the subgroup operations which is unexpected (since helper lanes are expected to participate). So softwqm needs to seed WQM requirements, but only if there is at least one hard wqm requirement in the shader.

arsenm added inline comments.Jul 19 2019, 6:43 AM

lib/Target/AMDGPU/SIISelLowering.cpp
5955–5959	It's easier to directly select than to enumerate all the possible types. I would still expect all of these direct-to-machine-node intrinsics to be handled in AMDGPUISelDAGToDAG

Add missing code in SI Fix SGPR copies.

Harbormaster completed remote builds in B35376: Diff 210865.Jul 19 2019, 11:06 AM

In D64935#1593333, @nhaehnle wrote:

Have you checked that this actually fixes the reported CTS failure?

Yes, with the associated (minimal) frontend changes this fixes the CTS failure.

In D64935#1593333, @nhaehnle wrote:

The derivative calculation enables WQM, but then we may leave WQM again for the subgroup operations which is unexpected (since helper lanes are expected to participate). So softwqm needs to seed WQM requirements, but only if there is at least one hard wqm requirement in the shader.

While my understanding of "seed requirements" means "for the whole shader", this code does what you expect.
If there are any hard WQM requirements for the shader, then all softwqm instructions (and their dependencies) are marked WQM.

Okay thanks, I see the logic now.

lib/Target/AMDGPU/SIISelLowering.cpp
5955–5959	You mean adding an `AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN` and lowering the softwqm intrinsic there? That does make sense to me.
lib/Target/AMDGPU/SIInstructions.td
114	s/wcm/wqm/

Move opcode selection to AMDGPUISelDAGToDAG.
Fix typo in comment.

I've moved the selection to AMDGPUISelDAGToDAG.
If this code is appropriate I will submit a follow change to move the selection for llvm.amdgcn.wqm and llvm.amdgcn.wwm as well.

Harbormaster completed remote builds in B35461: Diff 211099.Jul 22 2019, 7:45 AM

LGTM. Followup for WQM/WWM sounds good to me as well.

This revision is now accepted and ready to land.Jul 24 2019, 1:27 AM

Closed by commit rL367097: [AMDGPU] Add llvm.amdgcn.softwqm intrinsic (authored by critson). · Explain WhyJul 26 2019, 2:56 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

7 lines

lib/

Target/

AMDGPU/

5 lines

3 lines

4 lines

10 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.softwqm.ll

188 lines

Diff 210619

include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 1,425 Lines • ▼ Show 20 Lines


	// Copies the source value to the destination value, with the guarantee that			// Copies the source value to the destination value, with the guarantee that
	// the source value is computed as if the entire program were executed in WQM.			// the source value is computed as if the entire program were executed in WQM.
	def int_amdgcn_wqm : Intrinsic<[llvm_any_ty],			def int_amdgcn_wqm : Intrinsic<[llvm_any_ty],
	[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable]			[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable]
	>;			>;

				// Copies the source value to the destination value, such that the source
				// is computed as if the entire program were executed in WQM if any other
				// program code executes in WQM.
				def int_amdgcn_softwqm : Intrinsic<[llvm_any_ty],
				[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable]
				>;

	// Return true if at least one thread within the pixel quad passes true into			// Return true if at least one thread within the pixel quad passes true into
	// the function.			// the function.
	def int_amdgcn_wqm_vote : Intrinsic<[llvm_i1_ty],			def int_amdgcn_wqm_vote : Intrinsic<[llvm_i1_ty],
	[llvm_i1_ty], [IntrNoMem, IntrConvergent]			[llvm_i1_ty], [IntrNoMem, IntrConvergent]
	>;			>;

	// If false, set EXEC=0 for the current thread until the end of program.			// If false, set EXEC=0 for the current thread until the end of program.
	def int_amdgcn_kill : Intrinsic<[], [llvm_i1_ty], []>;			def int_amdgcn_kill : Intrinsic<[], [llvm_i1_ty], []>;
	▲ Show 20 Lines • Show All 371 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,946 Lines • ▼ Show 20 Lines	SDValue Node = DAG.getNode(Opcode, DL, MVT::i32,
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
return DAG.getNode(ISD::BITCAST, DL, VT, Node);		return DAG.getNode(ISD::BITCAST, DL, VT, Node);
}		}
case Intrinsic::amdgcn_wqm: {		case Intrinsic::amdgcn_wqm: {
SDValue Src = Op.getOperand(1);		SDValue Src = Op.getOperand(1);
return SDValue(DAG.getMachineNode(AMDGPU::WQM, DL, Src.getValueType(), Src),		return SDValue(DAG.getMachineNode(AMDGPU::WQM, DL, Src.getValueType(), Src),
0);		0);
}		}
		case Intrinsic::amdgcn_softwqm: {
		SDValue Src = Op.getOperand(1);
		return SDValue(DAG.getMachineNode(AMDGPU::SOFT_WQM, DL, Src.getValueType(), Src),
		0);
		}
		arsenmUnsubmitted Not Done Reply Inline Actions Is there some reason you can't just handle this with an instruction pattern? arsenm: Is there some reason you can't just handle this with an instruction pattern?
		critsonAuthorUnsubmitted Done Reply Inline Actions For the same reason as llvm.amdgcn.wqm, we don't specify the input and output types. Happy to be corrected, but I don't think there is a way to have a single instruction pattern covering all types. critson: For the same reason as llvm.amdgcn.wqm, we don't specify the input and output types. Happy to…
		arsenmUnsubmitted Not Done Reply Inline Actions It's easier to directly select than to enumerate all the possible types. I would still expect all of these direct-to-machine-node intrinsics to be handled in AMDGPUISelDAGToDAG arsenm: It's easier to directly select than to enumerate all the possible types. I would still expect…
		nhaehnleUnsubmitted Not Done Reply Inline Actions You mean adding an `AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN` and lowering the softwqm intrinsic there? That does make sense to me. nhaehnle: You mean adding an `AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN` and lowering the softwqm…
case Intrinsic::amdgcn_wwm: {		case Intrinsic::amdgcn_wwm: {
SDValue Src = Op.getOperand(1);		SDValue Src = Op.getOperand(1);
return SDValue(DAG.getMachineNode(AMDGPU::WWM, DL, Src.getValueType(), Src),		return SDValue(DAG.getMachineNode(AMDGPU::WWM, DL, Src.getValueType(), Src),
0);		0);
}		}
case Intrinsic::amdgcn_fmad_ftz:		case Intrinsic::amdgcn_fmad_ftz:
return DAG.getNode(AMDGPUISD::FMAD_FTZ, DL, VT, Op.getOperand(1),		return DAG.getNode(AMDGPUISD::FMAD_FTZ, DL, VT, Op.getOperand(1),
Op.getOperand(2), Op.getOperand(3));		Op.getOperand(2), Op.getOperand(3));
▲ Show 20 Lines • Show All 4,788 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 3,625 Lines • ▼ Show 20 Lines
unsigned SIInstrInfo::getVALUOp(const MachineInstr &MI) const {		unsigned SIInstrInfo::getVALUOp(const MachineInstr &MI) const {
switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
default: return AMDGPU::INSTRUCTION_LIST_END;		default: return AMDGPU::INSTRUCTION_LIST_END;
case AMDGPU::REG_SEQUENCE: return AMDGPU::REG_SEQUENCE;		case AMDGPU::REG_SEQUENCE: return AMDGPU::REG_SEQUENCE;
case AMDGPU::COPY: return AMDGPU::COPY;		case AMDGPU::COPY: return AMDGPU::COPY;
case AMDGPU::PHI: return AMDGPU::PHI;		case AMDGPU::PHI: return AMDGPU::PHI;
case AMDGPU::INSERT_SUBREG: return AMDGPU::INSERT_SUBREG;		case AMDGPU::INSERT_SUBREG: return AMDGPU::INSERT_SUBREG;
case AMDGPU::WQM: return AMDGPU::WQM;		case AMDGPU::WQM: return AMDGPU::WQM;
		case AMDGPU::SOFT_WQM: return AMDGPU::SOFT_WQM;
case AMDGPU::WWM: return AMDGPU::WWM;		case AMDGPU::WWM: return AMDGPU::WWM;
case AMDGPU::S_MOV_B32: {		case AMDGPU::S_MOV_B32: {
const MachineRegisterInfo &MRI = MI.getParent()->getParent()->getRegInfo();		const MachineRegisterInfo &MRI = MI.getParent()->getParent()->getRegInfo();
return MI.getOperand(1).isReg() \|\|		return MI.getOperand(1).isReg() \|\|
RI.isAGPR(MRI, MI.getOperand(0).getReg()) ?		RI.isAGPR(MRI, MI.getOperand(0).getReg()) ?
AMDGPU::COPY : AMDGPU::V_MOV_B32_e32;		AMDGPU::COPY : AMDGPU::V_MOV_B32_e32;
}		}
case AMDGPU::S_ADD_I32:		case AMDGPU::S_ADD_I32:
▲ Show 20 Lines • Show All 1,859 Lines • ▼ Show 20 Lines	for (MachineRegisterInfo::use_iterator I = MRI.use_begin(DstReg),
E = MRI.use_end(); I != E;) {		E = MRI.use_end(); I != E;) {
MachineInstr &UseMI = *I->getParent();		MachineInstr &UseMI = *I->getParent();

unsigned OpNo = 0;		unsigned OpNo = 0;

switch (UseMI.getOpcode()) {		switch (UseMI.getOpcode()) {
case AMDGPU::COPY:		case AMDGPU::COPY:
case AMDGPU::WQM:		case AMDGPU::WQM:
		case AMDGPU::SOFT_WQM:
case AMDGPU::WWM:		case AMDGPU::WWM:
case AMDGPU::REG_SEQUENCE:		case AMDGPU::REG_SEQUENCE:
case AMDGPU::PHI:		case AMDGPU::PHI:
case AMDGPU::INSERT_SUBREG:		case AMDGPU::INSERT_SUBREG:
break;		break;
default:		default:
OpNo = I.getOperandNo();		OpNo = I.getOperandNo();
break;		break;
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	const TargetRegisterClass *SIInstrInfo::getDestEquivalentVGPRClass(
// For target instructions, getOpRegClass just returns the virtual register		// For target instructions, getOpRegClass just returns the virtual register
// class associated with the operand, so we need to find an equivalent VGPR		// class associated with the operand, so we need to find an equivalent VGPR
// register class in order to move the instruction to the VALU.		// register class in order to move the instruction to the VALU.
case AMDGPU::COPY:		case AMDGPU::COPY:
case AMDGPU::PHI:		case AMDGPU::PHI:
case AMDGPU::REG_SEQUENCE:		case AMDGPU::REG_SEQUENCE:
case AMDGPU::INSERT_SUBREG:		case AMDGPU::INSERT_SUBREG:
case AMDGPU::WQM:		case AMDGPU::WQM:
		case AMDGPU::SOFT_WQM:
case AMDGPU::WWM: {		case AMDGPU::WWM: {
const TargetRegisterClass *SrcRC = getOpRegClass(Inst, 1);		const TargetRegisterClass *SrcRC = getOpRegClass(Inst, 1);
if (RI.hasAGPRs(SrcRC)) {		if (RI.hasAGPRs(SrcRC)) {
if (RI.hasAGPRs(NewDstRC))		if (RI.hasAGPRs(NewDstRC))
return nullptr;		return nullptr;

NewDstRC = RI.getEquivalentAGPRClass(NewDstRC);		NewDstRC = RI.getEquivalentAGPRClass(NewDstRC);
if (!NewDstRC)		if (!NewDstRC)
▲ Show 20 Lines • Show All 729 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstructions.td

	Show First 20 Lines • Show All 105 Lines • ▼ Show 20 Lines
	// SIFoldOperands pass to enable folding of inline immediates.			// SIFoldOperands pass to enable folding of inline immediates.
	def V_MOV_B64_PSEUDO : VPseudoInstSI <(outs VReg_64:$vdst),			def V_MOV_B64_PSEUDO : VPseudoInstSI <(outs VReg_64:$vdst),
	(ins VSrc_b64:$src0)>;			(ins VSrc_b64:$src0)>;

	// Pseudoinstruction for @llvm.amdgcn.wqm. It is turned into a copy after the			// Pseudoinstruction for @llvm.amdgcn.wqm. It is turned into a copy after the
	// WQM pass processes it.			// WQM pass processes it.
	def WQM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;			def WQM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;

				// Pseudoinstruction for @llvm.amdgcn.softwqm. Like @llvm.amdgcn.wcm it is
				nhaehnleUnsubmitted Done Reply Inline Actions s/wcm/wqm/ nhaehnle: s/wcm/wqm/
				// turned into a copy by WQM pass, but does not seed WQM requirements.
				def SOFT_WQM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;

	// Pseudoinstruction for @llvm.amdgcn.wwm. It is turned into a copy post-RA, so			// Pseudoinstruction for @llvm.amdgcn.wwm. It is turned into a copy post-RA, so
	// that the @earlyclobber is respected. The @earlyclobber is to make sure that			// that the @earlyclobber is respected. The @earlyclobber is to make sure that
	// the instruction that defines $src0 (which is run in WWM) doesn't			// the instruction that defines $src0 (which is run in WWM) doesn't
	// accidentally clobber inactive channels of $vdst.			// accidentally clobber inactive channels of $vdst.
	let Constraints = "@earlyclobber $vdst" in {			let Constraints = "@earlyclobber $vdst" in {
	def WWM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;			def WWM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;
	}			}

	▲ Show 20 Lines • Show All 1,796 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIWholeQuadMode.cpp

Show First 20 Lines • Show All 306 Lines • ▼ Show 20 Lines

// Scan instructions to determine which ones require an Exact execmask and		// Scan instructions to determine which ones require an Exact execmask and
// which ones seed WQM requirements.		// which ones seed WQM requirements.
char SIWholeQuadMode::scanInstructions(MachineFunction &MF,		char SIWholeQuadMode::scanInstructions(MachineFunction &MF,
std::vector<WorkItem> &Worklist) {		std::vector<WorkItem> &Worklist) {
char GlobalFlags = 0;		char GlobalFlags = 0;
bool WQMOutputs = MF.getFunction().hasFnAttribute("amdgpu-ps-wqm-outputs");		bool WQMOutputs = MF.getFunction().hasFnAttribute("amdgpu-ps-wqm-outputs");
SmallVector<MachineInstr *, 4> SetInactiveInstrs;		SmallVector<MachineInstr *, 4> SetInactiveInstrs;
		SmallVector<MachineInstr *, 4> SoftWQMInstrs;

// We need to visit the basic blocks in reverse post-order so that we visit		// We need to visit the basic blocks in reverse post-order so that we visit
// defs before uses, in particular so that we don't accidentally mark an		// defs before uses, in particular so that we don't accidentally mark an
// instruction as needing e.g. WQM before visiting it and realizing it needs		// instruction as needing e.g. WQM before visiting it and realizing it needs
// WQM disabled.		// WQM disabled.
ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);		ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
for (auto BI = RPOT.begin(), BE = RPOT.end(); BI != BE; ++BI) {		for (auto BI = RPOT.begin(), BE = RPOT.end(); BI != BE; ++BI) {
MachineBasicBlock &MBB = **BI;		MachineBasicBlock &MBB = **BI;
Show All 12 Lines	for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
markInstructionUses(MI, StateWQM, Worklist);		markInstructionUses(MI, StateWQM, Worklist);
GlobalFlags \|= StateWQM;		GlobalFlags \|= StateWQM;
continue;		continue;
} else if (Opcode == AMDGPU::WQM) {		} else if (Opcode == AMDGPU::WQM) {
// The WQM intrinsic requires its output to have all the helper lanes		// The WQM intrinsic requires its output to have all the helper lanes
// correct, so we need it to be in WQM.		// correct, so we need it to be in WQM.
Flags = StateWQM;		Flags = StateWQM;
LowerToCopyInstrs.push_back(&MI);		LowerToCopyInstrs.push_back(&MI);
		} else if (Opcode == AMDGPU::SOFT_WQM) {
		LowerToCopyInstrs.push_back(&MI);
		SoftWQMInstrs.push_back(&MI);
		continue;
} else if (Opcode == AMDGPU::WWM) {		} else if (Opcode == AMDGPU::WWM) {
// The WWM intrinsic doesn't make the same guarantee, and plus it needs		// The WWM intrinsic doesn't make the same guarantee, and plus it needs
// to be executed in WQM or Exact so that its copy doesn't clobber		// to be executed in WQM or Exact so that its copy doesn't clobber
// inactive lanes.		// inactive lanes.
markInstructionUses(MI, StateWWM, Worklist);		markInstructionUses(MI, StateWWM, Worklist);
GlobalFlags \|= StateWWM;		GlobalFlags \|= StateWWM;
LowerToCopyInstrs.push_back(&MI);		LowerToCopyInstrs.push_back(&MI);
continue;		continue;
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
markInstruction(MI, Flags, Worklist);		markInstruction(MI, Flags, Worklist);
GlobalFlags \|= Flags;		GlobalFlags \|= Flags;
}		}
}		}

// Mark sure that any SET_INACTIVE instructions are computed in WQM if WQM is		// Mark sure that any SET_INACTIVE instructions are computed in WQM if WQM is
// ever used anywhere in the function. This implements the corresponding		// ever used anywhere in the function. This implements the corresponding
// semantics of @llvm.amdgcn.set.inactive.		// semantics of @llvm.amdgcn.set.inactive.
		// Similarly for SOFT_WQM instructions, implementing @llvm.amdgcn.softwqm.
if (GlobalFlags & StateWQM) {		if (GlobalFlags & StateWQM) {
for (MachineInstr *MI : SetInactiveInstrs)		for (MachineInstr *MI : SetInactiveInstrs)
markInstruction(*MI, StateWQM, Worklist);		markInstruction(*MI, StateWQM, Worklist);
		for (MachineInstr *MI : SoftWQMInstrs)
		markInstruction(*MI, StateWQM, Worklist);
}		}

return GlobalFlags;		return GlobalFlags;
}		}

void SIWholeQuadMode::propagateInstruction(MachineInstr &MI,		void SIWholeQuadMode::propagateInstruction(MachineInstr &MI,
std::vector<WorkItem>& Worklist) {		std::vector<WorkItem>& Worklist) {
MachineBasicBlock *MBB = MI.getParent();		MachineBasicBlock *MBB = MI.getParent();
▲ Show 20 Lines • Show All 459 Lines • ▼ Show 20 Lines	bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
LIS = &getAnalysis<LiveIntervals>();		LIS = &getAnalysis<LiveIntervals>();

char GlobalFlags = analyzeFunction(MF);		char GlobalFlags = analyzeFunction(MF);
unsigned LiveMaskReg = 0;		unsigned LiveMaskReg = 0;
unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;		unsigned Exec = ST->isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
if (!(GlobalFlags & StateWQM)) {		if (!(GlobalFlags & StateWQM)) {
lowerLiveMaskQueries(Exec);		lowerLiveMaskQueries(Exec);
if (!(GlobalFlags & StateWWM))		if (!(GlobalFlags & StateWWM) && LowerToCopyInstrs.empty())
return !LiveMaskQueries.empty();		return !LiveMaskQueries.empty();
} else {		} else {
// Store a copy of the original live mask when required		// Store a copy of the original live mask when required
MachineBasicBlock &Entry = MF.front();		MachineBasicBlock &Entry = MF.front();
MachineBasicBlock::iterator EntryMI = Entry.getFirstNonPHI();		MachineBasicBlock::iterator EntryMI = Entry.getFirstNonPHI();

if (GlobalFlags & StateExact \|\| !LiveMaskQueries.empty()) {		if (GlobalFlags & StateExact \|\| !LiveMaskQueries.empty()) {
LiveMaskReg = MRI->createVirtualRegister(TRI->getBoolRC());		LiveMaskReg = MRI->createVirtualRegister(TRI->getBoolRC());
Show All 36 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=CHECK %s

				; Check that WQM is not triggered by the softwqm intrinsic alone.
				;
				;CHECK-LABEL: {{^}}test1:
				;CHECK-NOT: s_wqm_b64 exec, exec
				;CHECK: buffer_load_dword
				;CHECK: buffer_load_dword
				;CHECK: v_add_f32_e32
				define amdgpu_ps float @test1(i32 inreg %idx0, i32 inreg %idx1) {
				main_body:
				%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
				%out = fadd float %src0, %src1
				%out.0 = call float @llvm.amdgcn.softwqm.f32(float %out)
				ret float %out.0
				}

				; Check that the softwqm intrinsic works correctly for integers.
				;
				;CHECK-LABEL: {{^}}test2:
				;CHECK-NOT: s_wqm_b64 exec, exec
				;CHECK: buffer_load_dword
				;CHECK: buffer_load_dword
				;CHECK: v_add_f32_e32
				define amdgpu_ps float @test2(i32 inreg %idx0, i32 inreg %idx1) {
				main_body:
				%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
				%out = fadd float %src0, %src1
				%out.0 = bitcast float %out to i32
				%out.1 = call i32 @llvm.amdgcn.softwqm.i32(i32 %out.0)
				%out.2 = bitcast i32 %out.1 to float
				ret float %out.2
				}

				; Make sure the transition from WQM to Exact to softwqm does not trigger WQM.
				;
				;CHECK-LABEL: {{^}}test_softwqm1:
				;CHECK-NOT: s_wqm_b64 exec, exec
				;CHECK: buffer_load_dword
				;CHECK: buffer_load_dword
				;CHECK: buffer_store_dword
				;CHECK-NOT; s_wqm_b64 exec, exec
				;CHECK: v_add_f32_e32
				define amdgpu_ps float @test_softwqm1(i32 inreg %idx0, i32 inreg %idx1) {
				main_body:
				%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
				%temp = fadd float %src0, %src1
				call void @llvm.amdgcn.buffer.store.f32(float %temp, <4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%out = fadd float %temp, %temp
				%out.0 = call float @llvm.amdgcn.softwqm.f32(float %out)
				ret float %out.0
				}

				; Make sure the transition from WQM to Exact to softwqm does trigger WQM.
				;
				;CHECK-LABEL: {{^}}test_softwqm2:
				;CHECK: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK: s_wqm_b64 exec, exec
				;CHECK: buffer_load_dword
				;CHECK: buffer_load_dword
				;CHECK: s_and_b64 exec, exec, [[ORIG]]
				;CHECK: buffer_store_dword
				;CHECK; s_wqm_b64 exec, exec
				;CHECK: v_add_f32_e32
				define amdgpu_ps float @test_softwqm2(i32 inreg %idx0, i32 inreg %idx1) {
				main_body:
				%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
				%temp = fadd float %src0, %src1
				%temp.0 = call float @llvm.amdgcn.wqm.f32(float %temp)
				call void @llvm.amdgcn.buffer.store.f32(float %temp.0, <4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%out = fadd float %temp, %temp
				%out.0 = call float @llvm.amdgcn.softwqm.f32(float %out)
				ret float %out.0
				}

				; Make sure the transition from Exact to WWM then softwqm does not trigger WQM.
				;
				;CHECK-LABEL: {{^}}test_wwm1:
				;CHECK: buffer_load_dword
				;CHECK: buffer_store_dword
				;CHECK: s_or_saveexec_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], -1
				;CHECK: buffer_load_dword
				;CHECK: v_add_f32_e32
				;CHECK: s_mov_b64 exec, [[ORIG]]
				;CHECK-NOT: s_wqm_b64
				define amdgpu_ps float @test_wwm1(i32 inreg %idx0, i32 inreg %idx1) {
				main_body:
				%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.f32(float %src0, <4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
				%temp = fadd float %src0, %src1
				%temp.0 = call float @llvm.amdgcn.wwm.f32(float %temp)
				%out = fadd float %temp.0, %temp.0
				%out.0 = call float @llvm.amdgcn.softwqm.f32(float %out)
				ret float %out.0
				}

				; Check that softwqm on one case of branch does not trigger WQM for shader.
				;
				;CHECK-LABEL: {{^}}test_control_flow_0:
				;CHECK-NEXT: ; %main_body
				;CHECK-NOT: s_wqm_b64 exec, exec
				;CHECK: %ELSE
				;CHECK: store
				;CHECK: %IF
				;CHECK: buffer_load
				;CHECK: buffer_load
				define amdgpu_ps float @test_control_flow_0(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, i32 inreg %idx0, i32 inreg %idx1, i32 %c, i32 %z, float %data) {
				main_body:
				%cmp = icmp eq i32 %z, 0
				br i1 %cmp, label %IF, label %ELSE

				IF:
				%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
				%out = fadd float %src0, %src1
				%data.if = call float @llvm.amdgcn.softwqm.f32(float %out)
				br label %END

				ELSE:
				call void @llvm.amdgcn.buffer.store.f32(float %data, <4 x i32> undef, i32 %c, i32 0, i1 0, i1 0)
				br label %END

				END:
				%r = phi float [ %data.if, %IF ], [ %data, %ELSE ]
				ret float %r
				}

				; Check that softwqm on one case of branch is treated as WQM in WQM shader.
				;
				;CHECK-LABEL: {{^}}test_control_flow_1:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: %ELSE
				;CHECK: s_and_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]], [[ORIG]]
				;CHECK: store
				;CHECK: s_mov_b64 exec, [[SAVED]]
				;CHECK: %IF
				;CHECK-NOT: s_and_saveexec_b64
				;CHECK-NOT: s_and_b64 exec
				;CHECK: buffer_load
				;CHECK: buffer_load
				define amdgpu_ps float @test_control_flow_1(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, i32 inreg %idx0, i32 inreg %idx1, i32 %c, i32 %z, float %data) {
				main_body:
				%c.bc = bitcast i32 %c to float
				%tex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %c.bc, <8 x i32> %rsrc, <4 x i32> %sampler, i1 0, i32 0, i32 0) #0
				%tex0 = extractelement <4 x float> %tex, i32 0
				%dtex = call <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32 15, float %tex0, <8 x i32> %rsrc, <4 x i32> %sampler, i1 0, i32 0, i32 0) #0
				%data.sample = extractelement <4 x float> %dtex, i32 0

				%cmp = icmp eq i32 %z, 0
				br i1 %cmp, label %IF, label %ELSE

				IF:
				%src0 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i1 0, i1 0)
				%src1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> undef, i32 %idx1, i32 0, i1 0, i1 0)
				%out = fadd float %src0, %src1
				%data.if = call float @llvm.amdgcn.softwqm.f32(float %out)
				br label %END

				ELSE:
				call void @llvm.amdgcn.buffer.store.f32(float %data.sample, <4 x i32> undef, i32 %c, i32 0, i1 0, i1 0)
				br label %END

				END:
				%r = phi float [ %data.if, %IF ], [ %data, %ELSE ]
				ret float %r
				}

				declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #2
				declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #2
				declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #3
				declare <4 x float> @llvm.amdgcn.image.sample.1d.v4f32.f32(i32, float, <8 x i32>, <4 x i32>, i1, i32, i32) #3
				declare <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #3
				declare void @llvm.amdgcn.kill(i1) #1
				declare float @llvm.amdgcn.wqm.f32(float) #3
				declare float @llvm.amdgcn.softwqm.f32(float) #3
				declare i32 @llvm.amdgcn.softwqm.i32(i32) #3
				declare float @llvm.amdgcn.wwm.f32(float) #3

				attributes #1 = { nounwind }
				attributes #2 = { nounwind readonly }
				attributes #3 = { nounwind readnone }