This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add SIWholeQuadMode pass
ClosedPublic

Authored by nhaehnle on Mar 14 2016, 3:03 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
arsenm
mareko

Commits

rG213e87f2ee2f: AMDGPU: Add SIWholeQuadMode pass
rL263982: AMDGPU: Add SIWholeQuadMode pass

Summary

Whole quad mode is already enabled for pixel shaders that compute
derivatives, but it must be suspended for instructions that cause a
shader to have side effects (i.e. stores and atomics).

This pass addresses the issue by storing the real (initial) live mask
in a register, masking EXEC before instructions that require exact
execution and (re-)enabling WQM where required.

This pass is run before register coalescing so that we can use
machine SSA for analysis.

The changes in this patch expose a problem with the second machine
scheduling pass: target independent instructions like COPY implicitly
use EXEC when they operate on VGPRs, but this fact is not encoded in
the MIR. This can lead to miscompilation because instructions are
moved past changes to EXEC.

This patch fixes the problem by adding use-implicit operands to
target independent instructions. Some general codegen passes are
relaxed to work with such implicit use operands.

Diff Detail

Repository: rL LLVM

Event Timeline

nhaehnle updated this revision to Diff 50655.Mar 14 2016, 3:03 PM

nhaehnle retitled this revision from to AMDGPU: Add SIWholeQuadMode pass.

nhaehnle updated this object.

nhaehnle added reviewers: arsenm, • tstellarAMD, mareko.

nhaehnle added a subscriber: llvm-commits.

Herald added subscribers: arsenm, MatzeB. · View Herald TranscriptMar 14 2016, 3:03 PM

In the fast case where we have WQM but don't need to switch back and forth,
also mark COPYs as implicitly using EXEC in the entry block.

Noticed this in a more involved shader in a piglit run.

The target independent changes might get more visibility if they were in a separate patch.

The changes in this patch expose a problem with the second machine
scheduling pass: target independent instructions like COPY implicitly
use EXEC when they operate on VGPRs, but this fact is not encoded in
the MIR. This can lead to miscompilation because instructions are
moved past changes to EXEC.

Is the Scheduler the only pass we need to worry about? Would we be able to avoid the problem by implementing TargetInstrInfo::isSchedulingBoundry() ?

In D18162#376804, @tstellarAMD wrote:

The target independent changes might get more visibility if they were in a separate patch.

The changes in this patch expose a problem with the second machine
scheduling pass: target independent instructions like COPY implicitly
use EXEC when they operate on VGPRs, but this fact is not encoded in
the MIR. This can lead to miscompilation because instructions are
moved past changes to EXEC.

Is the Scheduler the only pass we need to worry about? Would we be able to avoid the problem by implementing TargetInstrInfo::isSchedulingBoundry() ?

Hmm, I wasn't aware of that function. Yes, customizing that function so that it considers instructions with EXEC defs as scheduling boundaries should work as well.

It's a slightly more conservative solution because it also prevents movement of SALU and SMEM instructions. Then again, it should not impact the initial DAG-based scheduling. Also, I notice that the target-independent code mentions that considering stack pointer modifications to be scheduling boundaries is profitable, and the same logic likely applies to EXEC. Plus, it's a more robust solution.

I'll modify the patch to use isSchedulingBoundary instead.

[This time with the correct --update parameter for arc]

Use isSchedulingBoundary instead of implicit-use of EXEC, which gets rid of
the target-independent modifications.

This is indeed more conservative, as you can tell from the change in
si-scheduler.ll: previously, the later scheduling passes managed to move the
initial s_wqm_b64 after the s_load_dwordx4 & x8, i.e. we lose slightly in
latency hiding.

The impact should be small, and it does make sense to land a more conservative
and robust patch initially.

• tstellarAMD added inline comments.Mar 17 2016, 2:26 PM

lib/Target/AMDGPU/AMDGPUInstrInfo.cpp
67 ↗	(On Diff #50943)	This should go in SIInstrInfo, since it is SI specific.
70–72 ↗	(On Diff #50943)	Can we just call TargetInstrInfo::isSchedulingBoundary() at the end of the function rather than have this check.

Makes sense, here's an updated patch.

LGTM, just a few things to fix before you commit.

lib/Target/AMDGPU/AMDGPUInstrInfo.h
65 ↗	(On Diff #51029)	Extra whitespace change.
lib/Target/AMDGPU/SIWholeQuadMode.cpp
311 ↗	(On Diff #51029)	Coding Style. Brace on new line.
327 ↗	(On Diff #51029)	Coding Style: Brace on new line.

This revision is now accepted and ready to land.Mar 18 2016, 3:44 PM

Should wqm related instructions be marked as convergent?

Thanks for the feedback Tom. I'll work in those changes before I commit.

Matt, I'm not sure "convergent" really captures the properties. In practice, I doubt it's a problem because the relevant instructions define EXEC - as far as I can tell, this means they're left alone by the passes that care about convergence, because EXEC is a physical register.

In D18162#378648, @nhaehnle wrote:

Thanks for the feedback Tom. I'll work in those changes before I commit.

Matt, I'm not sure "convergent" really captures the properties. In practice, I doubt it's a problem because the relevant instructions define EXEC - as far as I can tell, this means they're left alone by the passes that care about convergence, because EXEC is a physical register.

It's more likely to be a problem for the corresponding IR intrinsics

Ah okay. There are two intrinsic (types) related to all this:

The kill intrinsic. This is marked as having side effects, which I think should imply convergent.

The load / store / atomic intrinsics. They themselves don't involve derivatives. If their results are used in a derivative computation, then it is sufficient to ensure that they are always executed when the consumer is executed, but that automatically follows from plain data flow (just like any other normal computations).

So I think we're fine without convergent attributes anywhere.

As an addendum: In part, correctness depends on the guarantees for derivatives that are required by GLSL. For example, if you have:

%tmp = call <2 x i32> @llvm.amdgcn.image.load.v2i32(...)
%coords = bitcast and extract from %tmp
...
br i1 %cc, label %IF, label %ELSE

IF:
%texel = call <4 x float> @llvm.SI.image.sample.v2i32(<2 x i32> %coord, ...)
...

ELSE:
... %coord not used here or later ...

The derivative taken by the llvm.SI.image.sample is undefined in GLSL if the control-flow is dynamically non-uniform, so it is perfectly legal to sink the llvm.amdgcn.image.load into the IF block (and the same applies to any other computation that leads to a derivative).

In D18162#378648, @nhaehnle wrote:

Thanks for the feedback Tom. I'll work in those changes before I commit.

Matt, I'm not sure "convergent" really captures the properties. In practice, I doubt it's a problem because the relevant instructions define EXEC - as far as I can tell, this means they're left alone by the passes that care about convergence, because EXEC is a physical register.

It's more likely to be a problem for the corresponding IR intrinsics

In D18162#378751, @nhaehnle wrote:

Ah okay. There are two intrinsic (types) related to all this:

The kill intrinsic. This is marked as having side effects, which I think should imply convergent.

Side effects do not imply convergent

The load / store / atomic intrinsics. They themselves don't involve derivatives. If their results are used in a derivative computation, then it is sufficient to ensure that they are always executed when the consumer is executed, but that automatically follows from plain data flow (just like any other normal computations).

It shouldn't be necessary for these

In D18162#378752, @nhaehnle wrote:

The derivative taken by the llvm.SI.image.sample is undefined in GLSL if the control-flow is dynamically non-uniform, so it is perfectly legal to sink the llvm.amdgcn.image.load into the IF block (and the same applies to any other computation that leads to a derivative).

This is the same situation as barriers, so I think this should still be convergent. The problem it is solving is if LLVM introduces uses that do not hit uniform control flow, like introducing a call in either side of an if/then block

For kill: Okay, side effects do not imply convergent, but the kill can only be moved in a way that preserves its execution. So IR-level optimizations are allowed to move a kill into both the if and the else branch of a subsequent if-else block, but this still results in correct code: both branches may mask away some bits of EXEC, but since the control flow is joined again with a bit-wise OR, that's okay.

In D18162#379477, @arsenm wrote:

In D18162#378752, @nhaehnle wrote:

The derivative taken by the llvm.SI.image.sample is undefined in GLSL if the control-flow is dynamically non-uniform, so it is perfectly legal to sink the llvm.amdgcn.image.load into the IF block (and the same applies to any other computation that leads to a derivative).

This is the same situation as barriers, so I think this should still be convergent. The problem it is solving is if LLVM introduces uses that do not hit uniform control flow, like introducing a call in either side of an if/then block

I do not think this is the same situation as barriers. In barriers, they need to be convergent because all threads have to execute the same barrier instruction (at the same program counter) for the semantics to remain correct.

If (plain) loads are pushed down into either side of an if/else block, that's inefficient but correct (of course assuming no stores in between etc.). There is no synchronization or data exchange between threads, so all that matters is that the right value is loaded into the right register on a per-thread basis; which instructions do the job at which program counter value is not important.

If I am still misunderstanding you, perhaps an example would be helpful.

Closed by commit rL263982: AMDGPU: Add SIWholeQuadMode pass (authored by nha). · Explain WhyMar 21 2016, 1:33 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPU.h

4 lines

AMDGPUInstrInfo.h

1 line

AMDGPUTargetMachine.cpp

2 lines

CMakeLists.txt

1 line

SIInstrInfo.h

4 lines

SIInstrInfo.cpp

13 lines

SILowerControlFlow.cpp

33 lines

SIRegisterInfo.h

7 lines

SIWholeQuadMode.cpp

465 lines

test/

CodeGen/

AMDGPU/

wqm.ll

348 lines

Diff 51219

llvm/trunk/lib/Target/AMDGPU/AMDGPU.h

	Show All 38 Lines

	// SI Passes			// SI Passes
	FunctionPass *createSITypeRewriter();			FunctionPass *createSITypeRewriter();
	FunctionPass *createSIAnnotateControlFlowPass();			FunctionPass *createSIAnnotateControlFlowPass();
	FunctionPass *createSIFoldOperandsPass();			FunctionPass *createSIFoldOperandsPass();
	FunctionPass *createSILowerI1CopiesPass();			FunctionPass *createSILowerI1CopiesPass();
	FunctionPass *createSIShrinkInstructionsPass();			FunctionPass *createSIShrinkInstructionsPass();
	FunctionPass *createSILoadStoreOptimizerPass(TargetMachine &tm);			FunctionPass *createSILoadStoreOptimizerPass(TargetMachine &tm);
				FunctionPass *createSIWholeQuadModePass();
	FunctionPass *createSILowerControlFlowPass();			FunctionPass *createSILowerControlFlowPass();
	FunctionPass *createSIFixControlFlowLiveIntervalsPass();			FunctionPass *createSIFixControlFlowLiveIntervalsPass();
	FunctionPass *createSIFixSGPRCopiesPass();			FunctionPass *createSIFixSGPRCopiesPass();
	FunctionPass *createSIFixSGPRLiveRangesPass();			FunctionPass *createSIFixSGPRLiveRangesPass();
	FunctionPass *createSICodeEmitterPass(formatted_raw_ostream &OS);			FunctionPass *createSICodeEmitterPass(formatted_raw_ostream &OS);
	FunctionPass *createSIInsertNopsPass();			FunctionPass *createSIInsertNopsPass();
	FunctionPass *createSIInsertWaitsPass();			FunctionPass *createSIInsertWaitsPass();

	Show All 10 Lines
	extern char &SIFixSGPRCopiesID;			extern char &SIFixSGPRCopiesID;

	void initializeSILowerI1CopiesPass(PassRegistry &);			void initializeSILowerI1CopiesPass(PassRegistry &);
	extern char &SILowerI1CopiesID;			extern char &SILowerI1CopiesID;

	void initializeSILoadStoreOptimizerPass(PassRegistry &);			void initializeSILoadStoreOptimizerPass(PassRegistry &);
	extern char &SILoadStoreOptimizerID;			extern char &SILoadStoreOptimizerID;

				void initializeSIWholeQuadModePass(PassRegistry &);
				extern char &SIWholeQuadModeID;

	void initializeSILowerControlFlowPass(PassRegistry &);			void initializeSILowerControlFlowPass(PassRegistry &);
	extern char &SILowerControlFlowPassID;			extern char &SILowerControlFlowPassID;


	// Passes common to R600 and SI			// Passes common to R600 and SI
	FunctionPass createAMDGPUPromoteAlloca(const TargetMachine TM = nullptr);			FunctionPass createAMDGPUPromoteAlloca(const TargetMachine TM = nullptr);
	void initializeAMDGPUPromoteAllocaPass(PassRegistry&);			void initializeAMDGPUPromoteAllocaPass(PassRegistry&);
	extern char &AMDGPUPromoteAllocaID;			extern char &AMDGPUPromoteAllocaID;
	▲ Show 20 Lines • Show All 98 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.h

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	public:
int getIndirectIndexEnd(const MachineFunction &MF) const;		int getIndirectIndexEnd(const MachineFunction &MF) const;

bool enableClusterLoads() const override;		bool enableClusterLoads() const override;

bool shouldScheduleLoadsNear(SDNode Load1, SDNode Load2,		bool shouldScheduleLoadsNear(SDNode Load1, SDNode Load2,
int64_t Offset1, int64_t Offset2,		int64_t Offset1, int64_t Offset2,
unsigned NumLoads) const override;		unsigned NumLoads) const override;


/// \brief Return a target-specific opcode if Opcode is a pseudo instruction.		/// \brief Return a target-specific opcode if Opcode is a pseudo instruction.
/// Return -1 if the target-specific opcode for the pseudo instruction does		/// Return -1 if the target-specific opcode for the pseudo instruction does
/// not exist. If Opcode is not a pseudo instruction, this is identity.		/// not exist. If Opcode is not a pseudo instruction, this is identity.
int pseudoToMCOpcode(int Opcode) const;		int pseudoToMCOpcode(int Opcode) const;

//===---------------------------------------------------------------------===//		//===---------------------------------------------------------------------===//
// Pure virtual funtions to be implemented by sub-classes.		// Pure virtual funtions to be implemented by sub-classes.
//===---------------------------------------------------------------------===//		//===---------------------------------------------------------------------===//
Show All 24 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
initializeSIFixControlFlowLiveIntervalsPass(*PR);		initializeSIFixControlFlowLiveIntervalsPass(*PR);
initializeSILoadStoreOptimizerPass(*PR);		initializeSILoadStoreOptimizerPass(*PR);
initializeAMDGPUAnnotateKernelFeaturesPass(*PR);		initializeAMDGPUAnnotateKernelFeaturesPass(*PR);
initializeAMDGPUAnnotateUniformValuesPass(*PR);		initializeAMDGPUAnnotateUniformValuesPass(*PR);
initializeAMDGPUPromoteAllocaPass(*PR);		initializeAMDGPUPromoteAllocaPass(*PR);
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertNopsPass(*PR);		initializeSIInsertNopsPass(*PR);
initializeSIInsertWaitsPass(*PR);		initializeSIInsertWaitsPass(*PR);
		initializeSIWholeQuadModePass(*PR);
initializeSILowerControlFlowPass(*PR);		initializeSILowerControlFlowPass(*PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
if (TT.getOS() == Triple::AMDHSA)		if (TT.getOS() == Triple::AMDHSA)
return make_unique<AMDGPUHSATargetObjectFile>();		return make_unique<AMDGPUHSATargetObjectFile>();

return make_unique<AMDGPUTargetObjectFile>();		return make_unique<AMDGPUTargetObjectFile>();
▲ Show 20 Lines • Show All 273 Lines • ▼ Show 20 Lines	if (getOptLevel() > CodeGenOpt::None && ST.loadStoreOptEnabled()) {
// merging nonadjacent loads.		// merging nonadjacent loads.

// This should be run after scheduling, but before register allocation. It		// This should be run after scheduling, but before register allocation. It
// also need extra copies to the address operand to be eliminated.		// also need extra copies to the address operand to be eliminated.
insertPass(&MachineSchedulerID, &SILoadStoreOptimizerID);		insertPass(&MachineSchedulerID, &SILoadStoreOptimizerID);
insertPass(&MachineSchedulerID, &RegisterCoalescerID);		insertPass(&MachineSchedulerID, &RegisterCoalescerID);
}		}
addPass(createSIShrinkInstructionsPass(), false);		addPass(createSIShrinkInstructionsPass(), false);
		addPass(createSIWholeQuadModePass());
}		}

void GCNPassConfig::addFastRegAlloc(FunctionPass *RegAllocPass) {		void GCNPassConfig::addFastRegAlloc(FunctionPass *RegAllocPass) {
addPass(&SIFixSGPRLiveRangesID);		addPass(&SIFixSGPRLiveRangesID);
TargetPassConfig::addFastRegAlloc(RegAllocPass);		TargetPassConfig::addFastRegAlloc(RegAllocPass);
}		}

void GCNPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {		void GCNPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {
Show All 26 Lines

llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
SILoadStoreOptimizer.cpp		SILoadStoreOptimizer.cpp
SILowerControlFlow.cpp		SILowerControlFlow.cpp
SILowerI1Copies.cpp		SILowerI1Copies.cpp
SIMachineFunctionInfo.cpp		SIMachineFunctionInfo.cpp
SIMachineScheduler.cpp		SIMachineScheduler.cpp
SIRegisterInfo.cpp		SIRegisterInfo.cpp
SIShrinkInstructions.cpp		SIShrinkInstructions.cpp
SITypeRewriter.cpp		SITypeRewriter.cpp
		SIWholeQuadMode.cpp
)		)

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(Utils)		add_subdirectory(Utils)

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	bool FoldImmediate(MachineInstr UseMI, MachineInstr DefMI,
unsigned Reg, MachineRegisterInfo *MRI) const final;		unsigned Reg, MachineRegisterInfo *MRI) const final;

unsigned getMachineCSELookAheadLimit() const override { return 500; }		unsigned getMachineCSELookAheadLimit() const override { return 500; }

MachineInstr *convertToThreeAddress(MachineFunction::iterator &MBB,		MachineInstr *convertToThreeAddress(MachineFunction::iterator &MBB,
MachineBasicBlock::iterator &MI,		MachineBasicBlock::iterator &MI,
LiveVariables *LV) const override;		LiveVariables *LV) const override;

		bool isSchedulingBoundary(const MachineInstr *MI,
		const MachineBasicBlock *MBB,
		const MachineFunction &MF) const override;

static bool isSALU(const MachineInstr &MI) {		static bool isSALU(const MachineInstr &MI) {
return MI.getDesc().TSFlags & SIInstrFlags::SALU;		return MI.getDesc().TSFlags & SIInstrFlags::SALU;
}		}

bool isSALU(uint16_t Opcode) const {		bool isSALU(uint16_t Opcode) const {
return get(Opcode).TSFlags & SIInstrFlags::SALU;		return get(Opcode).TSFlags & SIInstrFlags::SALU;
}		}

▲ Show 20 Lines • Show All 361 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 1,242 Lines • ▼ Show 20 Lines	return BuildMI(*MBB, MI, MI->getDebugLoc(), get(AMDGPU::V_MAD_F32))
.addImm(0) // Src1 mods		.addImm(0) // Src1 mods
.addOperand(*Src1)		.addOperand(*Src1)
.addImm(0) // Src mods		.addImm(0) // Src mods
.addOperand(*Src2)		.addOperand(*Src2)
.addImm(0) // clamp		.addImm(0) // clamp
.addImm(0); // omod		.addImm(0); // omod
}		}

		bool SIInstrInfo::isSchedulingBoundary(const MachineInstr *MI,
		const MachineBasicBlock *MBB,
		const MachineFunction &MF) const {
		// Target-independent instructions do not have an implicit-use of EXEC, even
		// when they operate on VGPRs. Treating EXEC modifications as scheduling
		// boundaries prevents incorrect movements of such instructions.
		const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
		if (MI->modifiesRegister(AMDGPU::EXEC, TRI))
		return true;

		return AMDGPUInstrInfo::isSchedulingBoundary(MI, MBB, MF);
		}

bool SIInstrInfo::isInlineConstant(const APInt &Imm) const {		bool SIInstrInfo::isInlineConstant(const APInt &Imm) const {
int64_t SVal = Imm.getSExtValue();		int64_t SVal = Imm.getSExtValue();
if (SVal >= -16 && SVal <= 64)		if (SVal >= -16 && SVal <= 64)
return true;		return true;

if (Imm.getBitWidth() == 64) {		if (Imm.getBitWidth() == 64) {
uint64_t Val = Imm.getZExtValue();		uint64_t Val = Imm.getZExtValue();
return (DoubleToBits(0.0) == Val) \|\|		return (DoubleToBits(0.0) == Val) \|\|
▲ Show 20 Lines • Show All 1,660 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SILowerControlFlow.cpp

Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	private:
const SIInstrInfo *TII;		const SIInstrInfo *TII;

bool shouldSkip(MachineBasicBlock From, MachineBasicBlock To);		bool shouldSkip(MachineBasicBlock From, MachineBasicBlock To);

void Skip(MachineInstr &From, MachineOperand &To);		void Skip(MachineInstr &From, MachineOperand &To);
void SkipIfDead(MachineInstr &MI);		void SkipIfDead(MachineInstr &MI);

void If(MachineInstr &MI);		void If(MachineInstr &MI);
void Else(MachineInstr &MI);		void Else(MachineInstr &MI, bool ExecModified);
void Break(MachineInstr &MI);		void Break(MachineInstr &MI);
void IfBreak(MachineInstr &MI);		void IfBreak(MachineInstr &MI);
void ElseBreak(MachineInstr &MI);		void ElseBreak(MachineInstr &MI);
void Loop(MachineInstr &MI);		void Loop(MachineInstr &MI);
void EndCf(MachineInstr &MI);		void EndCf(MachineInstr &MI);

void Kill(MachineInstr &MI);		void Kill(MachineInstr &MI);
void Branch(MachineInstr &MI);		void Branch(MachineInstr &MI);
▲ Show 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	BuildMI(MBB, &MI, DL, TII->get(AMDGPU::S_XOR_B64), Reg)
.addReg(AMDGPU::EXEC)		.addReg(AMDGPU::EXEC)
.addReg(Reg);		.addReg(Reg);

Skip(MI, MI.getOperand(2));		Skip(MI, MI.getOperand(2));

MI.eraseFromParent();		MI.eraseFromParent();
}		}

void SILowerControlFlow::Else(MachineInstr &MI) {		void SILowerControlFlow::Else(MachineInstr &MI, bool ExecModified) {
MachineBasicBlock &MBB = *MI.getParent();		MachineBasicBlock &MBB = *MI.getParent();
DebugLoc DL = MI.getDebugLoc();		DebugLoc DL = MI.getDebugLoc();
unsigned Dst = MI.getOperand(0).getReg();		unsigned Dst = MI.getOperand(0).getReg();
unsigned Src = MI.getOperand(1).getReg();		unsigned Src = MI.getOperand(1).getReg();

BuildMI(MBB, MBB.getFirstNonPHI(), DL,		BuildMI(MBB, MBB.getFirstNonPHI(), DL,
TII->get(AMDGPU::S_OR_SAVEEXEC_B64), Dst)		TII->get(AMDGPU::S_OR_SAVEEXEC_B64), Dst)
.addReg(Src); // Saved EXEC		.addReg(Src); // Saved EXEC

		if (ExecModified) {
		// Adjust the saved exec to account for the modifications during the flow
		// block that contains the ELSE. This can happen when WQM mode is switched
		// off.
		BuildMI(MBB, &MI, DL, TII->get(AMDGPU::S_AND_B64), Dst)
		.addReg(AMDGPU::EXEC)
		.addReg(Dst);
		}

BuildMI(MBB, &MI, DL, TII->get(AMDGPU::S_XOR_B64), AMDGPU::EXEC)		BuildMI(MBB, &MI, DL, TII->get(AMDGPU::S_XOR_B64), AMDGPU::EXEC)
.addReg(AMDGPU::EXEC)		.addReg(AMDGPU::EXEC)
.addReg(Dst);		.addReg(Dst);

Skip(MI, MI.getOperand(2));		Skip(MI, MI.getOperand(2));

MI.eraseFromParent();		MI.eraseFromParent();
}		}
▲ Show 20 Lines • Show All 247 Lines • ▼ Show 20 Lines

bool SILowerControlFlow::runOnMachineFunction(MachineFunction &MF) {		bool SILowerControlFlow::runOnMachineFunction(MachineFunction &MF) {
TII = static_cast<const SIInstrInfo *>(MF.getSubtarget().getInstrInfo());		TII = static_cast<const SIInstrInfo *>(MF.getSubtarget().getInstrInfo());
TRI =		TRI =
static_cast<const SIRegisterInfo *>(MF.getSubtarget().getRegisterInfo());		static_cast<const SIRegisterInfo *>(MF.getSubtarget().getRegisterInfo());
SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

bool HaveKill = false;		bool HaveKill = false;
bool NeedWQM = false;
bool NeedFlat = false;		bool NeedFlat = false;
unsigned Depth = 0;		unsigned Depth = 0;

for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();		for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
BI != BE; ++BI) {		BI != BE; ++BI) {

MachineBasicBlock *EmptyMBBAtEnd = NULL;		MachineBasicBlock *EmptyMBBAtEnd = NULL;
MachineBasicBlock &MBB = *BI;		MachineBasicBlock &MBB = *BI;
MachineBasicBlock::iterator I, Next;		MachineBasicBlock::iterator I, Next;
		bool ExecModified = false;

for (I = MBB.begin(); I != MBB.end(); I = Next) {		for (I = MBB.begin(); I != MBB.end(); I = Next) {
Next = std::next(I);		Next = std::next(I);

MachineInstr &MI = *I;		MachineInstr &MI = *I;
if (TII->isWQM(MI) \|\| TII->isDS(MI))
NeedWQM = true;

// Flat uses m0 in case it needs to access LDS.		// Flat uses m0 in case it needs to access LDS.
if (TII->isFLAT(MI))		if (TII->isFLAT(MI))
NeedFlat = true;		NeedFlat = true;

		for (const auto &Def : I->defs()) {
		if (Def.isReg() && Def.isDef() && Def.getReg() == AMDGPU::EXEC) {
		ExecModified = true;
		break;
		}
		}

switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
default: break;		default: break;
case AMDGPU::SI_IF:		case AMDGPU::SI_IF:
++Depth;		++Depth;
If(MI);		If(MI);
break;		break;

case AMDGPU::SI_ELSE:		case AMDGPU::SI_ELSE:
Else(MI);		Else(MI, ExecModified);
break;		break;

case AMDGPU::SI_BREAK:		case AMDGPU::SI_BREAK:
Break(MI);		Break(MI);
break;		break;

case AMDGPU::SI_IF_BREAK:		case AMDGPU::SI_IF_BREAK:
IfBreak(MI);		IfBreak(MI);
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	for (I = MBB.begin(); I != MBB.end(); I = Next) {

I->eraseFromParent();		I->eraseFromParent();
break;		break;
}		}
}		}
}		}
}		}

if (NeedWQM && MFI->getShaderType() == ShaderType::PIXEL) {
MachineBasicBlock &MBB = MF.front();
BuildMI(MBB, MBB.getFirstNonPHI(), DebugLoc(), TII->get(AMDGPU::S_WQM_B64),
AMDGPU::EXEC).addReg(AMDGPU::EXEC);
}

if (NeedFlat && MFI->IsKernel) {		if (NeedFlat && MFI->IsKernel) {
// TODO: What to use with function calls?		// TODO: What to use with function calls?
// We will need to Initialize the flat scratch register pair.		// We will need to Initialize the flat scratch register pair.
if (NeedFlat)		if (NeedFlat)
MFI->setHasFlatInstructions(true);		MFI->setHasFlatInstructions(true);
}		}

return true;		return true;
}		}

llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	public:
}		}

/// \returns true if this class ID contains only SGPR registers		/// \returns true if this class ID contains only SGPR registers
bool isSGPRClassID(unsigned RCID) const {		bool isSGPRClassID(unsigned RCID) const {
return isSGPRClass(getRegClass(RCID));		return isSGPRClass(getRegClass(RCID));
}		}

bool isSGPRReg(const MachineRegisterInfo &MRI, unsigned Reg) const {		bool isSGPRReg(const MachineRegisterInfo &MRI, unsigned Reg) const {
		const TargetRegisterClass *RC;
if (TargetRegisterInfo::isVirtualRegister(Reg))		if (TargetRegisterInfo::isVirtualRegister(Reg))
return isSGPRClass(MRI.getRegClass(Reg));		RC = MRI.getRegClass(Reg);
return getPhysRegClass(Reg);		else
		RC = getPhysRegClass(Reg);
		return isSGPRClass(RC);
}		}

/// \returns true if this class contains VGPR registers.		/// \returns true if this class contains VGPR registers.
bool hasVGPRs(const TargetRegisterClass *RC) const;		bool hasVGPRs(const TargetRegisterClass *RC) const;

/// returns true if this is a pseudoregister class combination of VGPRs and		/// returns true if this is a pseudoregister class combination of VGPRs and
/// SGPRs for operand modeling. FIXME: We should set isAllocatable = 0 on		/// SGPRs for operand modeling. FIXME: We should set isAllocatable = 0 on
/// them.		/// them.
▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIWholeQuadMode.cpp

				//===-- SIWholeQuadMode.cpp - enter and suspend whole quad mode -----------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief This pass adds instructions to enable whole quad mode for pixel
				/// shaders.
				///
				/// Whole quad mode is required for derivative computations, but it interferes
				/// with shader side effects (stores and atomics). This pass is run on the
				/// scheduled machine IR but before register coalescing, so that machine SSA is
				/// available for analysis. It ensures that WQM is enabled when necessary, but
				/// disabled around stores and atomics.
				///
				/// When necessary, this pass creates a function prolog
				///
				/// S_MOV_B64 LiveMask, EXEC
				/// S_WQM_B64 EXEC, EXEC
				///
				/// to enter WQM at the top of the function and surrounds blocks of Exact
				/// instructions by
				///
				/// S_AND_SAVEEXEC_B64 Tmp, LiveMask
				/// ...
				/// S_MOV_B64 EXEC, Tmp
				///
				/// In order to avoid excessive switching during sequences of Exact
				/// instructions, the pass first analyzes which instructions must be run in WQM
				/// (aka which instructions produce values that lead to derivative
				/// computations).
				///
				/// Basic blocks are always exited in WQM as long as some successor needs WQM.
				///
				/// There is room for improvement given better control flow analysis:
				///
				/// (1) at the top level (outside of control flow statements, and as long as
				/// kill hasn't been used), one SGPR can be saved by recovering WQM from
				/// the LiveMask (this is implemented for the entry block).
				///
				/// (2) when entire regions (e.g. if-else blocks or entire loops) only
				/// consist of exact and don't-care instructions, the switch only has to
				/// be done at the entry and exit points rather than potentially in each
				/// block of the region.
				///
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "SIInstrInfo.h"
				#include "SIMachineFunctionInfo.h"
				#include "llvm/CodeGen/MachineDominanceFrontier.h"
				#include "llvm/CodeGen/MachineDominators.h"
				#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/IR/Constants.h"

				using namespace llvm;

				#define DEBUG_TYPE "si-wqm"

				namespace {

				enum {
				StateWQM = 0x1,
				StateExact = 0x2,
				};

				struct InstrInfo {
				char Needs = 0;
				char OutNeeds = 0;
				};

				struct BlockInfo {
				char Needs = 0;
				char InNeeds = 0;
				char OutNeeds = 0;
				};

				struct WorkItem {
				const MachineBasicBlock *MBB = nullptr;
				const MachineInstr *MI = nullptr;

				WorkItem() {}
				WorkItem(const MachineBasicBlock *MBB) : MBB(MBB) {}
				WorkItem(const MachineInstr *MI) : MI(MI) {}
				};

				class SIWholeQuadMode : public MachineFunctionPass {
				private:
				const SIInstrInfo *TII;
				const SIRegisterInfo *TRI;
				MachineRegisterInfo *MRI;

				DenseMap<const MachineInstr *, InstrInfo> Instructions;
				DenseMap<const MachineBasicBlock *, BlockInfo> Blocks;
				SmallVector<const MachineInstr *, 2> ExecExports;

				char scanInstructions(const MachineFunction &MF, std::vector<WorkItem>& Worklist);
				void propagateInstruction(const MachineInstr &MI, std::vector<WorkItem>& Worklist);
				void propagateBlock(const MachineBasicBlock &MBB, std::vector<WorkItem>& Worklist);
				char analyzeFunction(const MachineFunction &MF);

				void toExact(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
				unsigned SaveWQM, unsigned LiveMaskReg);
				void toWQM(MachineBasicBlock &MBB, MachineBasicBlock::iterator Before,
				unsigned SavedWQM);
				void processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg, bool isEntry);

				public:
				static char ID;

				SIWholeQuadMode() :
				MachineFunctionPass(ID) { }

				bool runOnMachineFunction(MachineFunction &MF) override;

				const char *getPassName() const override {
				return "SI Whole Quad Mode";
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.setPreservesCFG();
				MachineFunctionPass::getAnalysisUsage(AU);
				}
				};

				} // End anonymous namespace

				char SIWholeQuadMode::ID = 0;

				INITIALIZE_PASS_BEGIN(SIWholeQuadMode, DEBUG_TYPE,
				"SI Whole Quad Mode", false, false)
				INITIALIZE_PASS_END(SIWholeQuadMode, DEBUG_TYPE,
				"SI Whole Quad Mode", false, false)

				char &llvm::SIWholeQuadModeID = SIWholeQuadMode::ID;

				FunctionPass *llvm::createSIWholeQuadModePass() {
				return new SIWholeQuadMode;
				}

				// Scan instructions to determine which ones require an Exact execmask and
				// which ones seed WQM requirements.
				char SIWholeQuadMode::scanInstructions(const MachineFunction &MF,
				std::vector<WorkItem> &Worklist) {
				char GlobalFlags = 0;

				for (auto BI = MF.begin(), BE = MF.end(); BI != BE; ++BI) {
				const MachineBasicBlock &MBB = *BI;

				for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
				const MachineInstr &MI = *II;
				unsigned Opcode = MI.getOpcode();
				char Flags;

				if (TII->isWQM(Opcode) \|\| TII->isDS(Opcode)) {
				Flags = StateWQM;
				} else if (TII->get(Opcode).mayStore() &&
				(MI.getDesc().TSFlags & SIInstrFlags::VM_CNT)) {
				Flags = StateExact;
				} else {
				// Handle export instructions with the exec mask valid flag set
				if (Opcode == AMDGPU::EXP && MI.getOperand(4).getImm() != 0)
				ExecExports.push_back(&MI);
				continue;
				}

				Instructions[&MI].Needs = Flags;
				Worklist.push_back(&MI);
				GlobalFlags \|= Flags;
				}
				}

				return GlobalFlags;
				}

				void SIWholeQuadMode::propagateInstruction(const MachineInstr &MI,
				std::vector<WorkItem>& Worklist) {
				const MachineBasicBlock &MBB = *MI.getParent();
				InstrInfo &II = Instructions[&MI];
				BlockInfo &BI = Blocks[&MBB];

				// Control flow-type instructions that are followed by WQM computations
				// must themselves be in WQM.
				if ((II.OutNeeds & StateWQM) && !(II.Needs & StateWQM) &&
				(MI.isBranch() \|\| MI.isTerminator() \|\| MI.getOpcode() == AMDGPU::SI_KILL))
				II.Needs = StateWQM;

				// Propagate to block level
				BI.Needs \|= II.Needs;
				if ((BI.InNeeds \| II.Needs) != BI.InNeeds) {
				BI.InNeeds \|= II.Needs;
				Worklist.push_back(&MBB);
				}

				// Propagate backwards within block
				if (const MachineInstr *PrevMI = MI.getPrevNode()) {
				char InNeeds = II.Needs \| II.OutNeeds;
				if (!PrevMI->isPHI()) {
				InstrInfo &PrevII = Instructions[PrevMI];
				if ((PrevII.OutNeeds \| InNeeds) != PrevII.OutNeeds) {
				PrevII.OutNeeds \|= InNeeds;
				Worklist.push_back(PrevMI);
				}
				}
				}

				// Propagate WQM flag to instruction inputs
				assert(II.Needs != (StateWQM \| StateExact));
				if (II.Needs != StateWQM)
				return;

				for (const MachineOperand &Use : MI.uses()) {
				if (!Use.isReg() \|\| !Use.isUse())
				continue;

				// At this point, physical registers appear as inputs or outputs
				// and following them makes no sense (and would in fact be incorrect
				// when the same VGPR is used as both an output and an input that leads
				// to a NeedsWQM instruction).
				//
				// Note: VCC appears e.g. in 64-bit addition with carry - theoretically we
				// have to trace this, in practice it happens for 64-bit computations like
				// pointers where both dwords are followed already anyway.
				if (!TargetRegisterInfo::isVirtualRegister(Use.getReg()))
				continue;

				for (const MachineOperand &Def : MRI->def_operands(Use.getReg())) {
				const MachineInstr *DefMI = Def.getParent();
				InstrInfo &DefII = Instructions[DefMI];

				// Obviously skip if DefMI is already flagged as NeedWQM.
				//
				// The instruction might also be flagged as NeedExact. This happens when
				// the result of an atomic is used in a WQM computation. In this case,
				// the atomic must not run for helper pixels and the WQM result is
				// undefined.
				if (DefII.Needs != 0)
				continue;

				DefII.Needs = StateWQM;
				Worklist.push_back(DefMI);
				}
				}
				}

				void SIWholeQuadMode::propagateBlock(const MachineBasicBlock &MBB,
				std::vector<WorkItem>& Worklist) {
				BlockInfo &BI = Blocks[&MBB];

				// Propagate through instructions
				if (!MBB.empty()) {
				const MachineInstr LastMI = &MBB.rbegin();
				InstrInfo &LastII = Instructions[LastMI];
				if ((LastII.OutNeeds \| BI.OutNeeds) != LastII.OutNeeds) {
				LastII.OutNeeds \|= BI.OutNeeds;
				Worklist.push_back(LastMI);
				}
				}

				// Predecessor blocks must provide for our WQM/Exact needs.
				for (const MachineBasicBlock *Pred : MBB.predecessors()) {
				BlockInfo &PredBI = Blocks[Pred];
				if ((PredBI.OutNeeds \| BI.InNeeds) == PredBI.OutNeeds)
				continue;

				PredBI.OutNeeds \|= BI.InNeeds;
				PredBI.InNeeds \|= BI.InNeeds;
				Worklist.push_back(Pred);
				}

				// All successors must be prepared to accept the same set of WQM/Exact
				// data.
				for (const MachineBasicBlock *Succ : MBB.successors()) {
				BlockInfo &SuccBI = Blocks[Succ];
				if ((SuccBI.InNeeds \| BI.OutNeeds) == SuccBI.InNeeds)
				continue;

				SuccBI.InNeeds \|= BI.OutNeeds;
				Worklist.push_back(Succ);
				}
				}

				char SIWholeQuadMode::analyzeFunction(const MachineFunction &MF) {
				std::vector<WorkItem> Worklist;
				char GlobalFlags = scanInstructions(MF, Worklist);

				while (!Worklist.empty()) {
				WorkItem WI = Worklist.back();
				Worklist.pop_back();

				if (WI.MI)
				propagateInstruction(*WI.MI, Worklist);
				else
				propagateBlock(*WI.MBB, Worklist);
				}

				return GlobalFlags;
				}

				void SIWholeQuadMode::toExact(MachineBasicBlock &MBB,
				MachineBasicBlock::iterator Before,
				unsigned SaveWQM, unsigned LiveMaskReg)
				{
				if (SaveWQM) {
				BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::S_AND_SAVEEXEC_B64),
				SaveWQM)
				.addReg(LiveMaskReg);
				} else {
				BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::S_AND_B64),
				AMDGPU::EXEC)
				.addReg(AMDGPU::EXEC)
				.addReg(LiveMaskReg);
				}
				}

				void SIWholeQuadMode::toWQM(MachineBasicBlock &MBB,
				MachineBasicBlock::iterator Before,
				unsigned SavedWQM)
				{
				if (SavedWQM) {
				BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::COPY), AMDGPU::EXEC)
				.addReg(SavedWQM);
				} else {
				BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::S_WQM_B64),
				AMDGPU::EXEC)
				.addReg(AMDGPU::EXEC);
				}
				}

				void SIWholeQuadMode::processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg,
				bool isEntry) {
				auto BII = Blocks.find(&MBB);
				if (BII == Blocks.end())
				return;

				const BlockInfo &BI = BII->second;

				if (!(BI.InNeeds & StateWQM))
				return;

				// This is a non-entry block that is WQM throughout, so no need to do
				// anything.
				if (!isEntry && !(BI.Needs & StateExact) && BI.OutNeeds != StateExact)
				return;

				unsigned SavedWQMReg = 0;
				bool WQMFromExec = isEntry;
				char State = isEntry ? StateExact : StateWQM;

				auto II = MBB.getFirstNonPHI(), IE = MBB.end();
				while (II != IE) {
				MachineInstr &MI = *II;
				++II;

				// Skip instructions that are not affected by EXEC
				if (MI.getDesc().TSFlags & (SIInstrFlags::SALU \| SIInstrFlags::SMRD) &&
				!MI.isBranch() && !MI.isTerminator())
				continue;

				// Generic instructions such as COPY will either disappear by register
				// coalescing or be lowered to SALU or VALU instructions.
				if (TargetInstrInfo::isGenericOpcode(MI.getOpcode())) {
				if (MI.getNumExplicitOperands() >= 1) {
				const MachineOperand &Op = MI.getOperand(0);
				if (Op.isReg()) {
				if (TRI->isSGPRReg(*MRI, Op.getReg())) {
				// SGPR instructions are not affected by EXEC
				continue;
				}
				}
				}
				}

				char Needs = 0;
				char OutNeeds = 0;
				auto InstrInfoIt = Instructions.find(&MI);
				if (InstrInfoIt != Instructions.end()) {
				Needs = InstrInfoIt->second.Needs;
				OutNeeds = InstrInfoIt->second.OutNeeds;

				// Make sure to switch to Exact mode before the end of the block when
				// Exact and only Exact is needed further downstream.
				if (OutNeeds == StateExact && (MI.isBranch() \|\| MI.isTerminator())) {
				assert(Needs == 0);
				Needs = StateExact;
				}
				}

				// State switching
				if (Needs && State != Needs) {
				if (Needs == StateExact) {
				assert(!SavedWQMReg);

				if (!WQMFromExec && (OutNeeds & StateWQM))
				SavedWQMReg = MRI->createVirtualRegister(&AMDGPU::SReg_64RegClass);

				toExact(MBB, &MI, SavedWQMReg, LiveMaskReg);
				} else {
				assert(WQMFromExec == (SavedWQMReg == 0));
				toWQM(MBB, &MI, SavedWQMReg);
				SavedWQMReg = 0;
				}

				State = Needs;
				}

				if (MI.getOpcode() == AMDGPU::SI_KILL)
				WQMFromExec = false;
				}

				if ((BI.OutNeeds & StateWQM) && State != StateWQM) {
				assert(WQMFromExec == (SavedWQMReg == 0));
				toWQM(MBB, MBB.end(), SavedWQMReg);
				} else if (BI.OutNeeds == StateExact && State != StateExact) {
				toExact(MBB, MBB.end(), 0, LiveMaskReg);
				}
				}

				bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {
				SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

				if (MFI->getShaderType() != ShaderType::PIXEL)
				return false;

				Instructions.clear();
				Blocks.clear();
				ExecExports.clear();

				TII = static_cast<const SIInstrInfo *>(MF.getSubtarget().getInstrInfo());
				TRI = static_cast<const SIRegisterInfo *>(MF.getSubtarget().getRegisterInfo());
				MRI = &MF.getRegInfo();

				char GlobalFlags = analyzeFunction(MF);
				if (!(GlobalFlags & StateWQM))
				return false;

				MachineBasicBlock &Entry = MF.front();
				MachineInstr *EntryMI = Entry.getFirstNonPHI();

				if (GlobalFlags == StateWQM) {
				// For a shader that needs only WQM, we can just set it once.
				BuildMI(Entry, EntryMI, DebugLoc(), TII->get(AMDGPU::S_WQM_B64),
				AMDGPU::EXEC).addReg(AMDGPU::EXEC);
				return true;
				}

				// Handle the general case
				unsigned LiveMaskReg = MRI->createVirtualRegister(&AMDGPU::SReg_64RegClass);
				BuildMI(Entry, EntryMI, DebugLoc(), TII->get(AMDGPU::COPY), LiveMaskReg)
				.addReg(AMDGPU::EXEC);

				for (const auto &BII : Blocks)
				processBlock(const_cast<MachineBasicBlock &>(*BII.first), LiveMaskReg,
				BII.first == &*MF.begin());

				return true;
				}

llvm/trunk/test/CodeGen/AMDGPU/wqm.ll

				;RUN: llc < %s -march=amdgcn -mcpu=verde -verify-machineinstrs \| FileCheck %s --check-prefix=CHECK --check-prefix=SI
				;RUN: llc < %s -march=amdgcn -mcpu=tonga -verify-machineinstrs \| FileCheck %s --check-prefix=CHECK --check-prefix=VI

				; Check that WQM isn't triggered by image load/store intrinsics.
				;
				;CHECK-LABEL: {{^}}test1:
				;CHECK-NOT: s_wqm
				define <4 x float> @test1(<8 x i32> inreg %rsrc, <4 x i32> %c) #0 {
				main_body:
				%tex = call <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
				call void @llvm.amdgcn.image.store.v4i32(<4 x float> %tex, <4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
				ret <4 x float> %tex
				}

				; Check that WQM is triggered by image samples and left untouched for loads...
				;
				;CHECK-LABEL: {{^}}test2:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: image_sample
				;CHECK-NOT: exec
				;CHECK: _load_dword v0,
				define float @test2(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, <4 x i32> %c) #0 {
				main_body:
				%c.1 = call <4 x float> @llvm.SI.image.sample.v4i32(<4 x i32> %c, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				%c.2 = bitcast <4 x float> %c.1 to <4 x i32>
				%c.3 = extractelement <4 x i32> %c.2, i32 0
				%gep = getelementptr float, float addrspace(1)* %ptr, i32 %c.3
				%data = load float, float addrspace(1)* %gep
				ret float %data
				}

				; ... but disabled for stores (and, in this simple case, not re-enabled).
				;
				;CHECK-LABEL: {{^}}test3:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: image_sample
				;CHECK: s_and_b64 exec, exec, [[ORIG]]
				;CHECK: store
				;CHECK-NOT: exec
				define <4 x float> @test3(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, <4 x i32> %c) #0 {
				main_body:
				%tex = call <4 x float> @llvm.SI.image.sample.v4i32(<4 x i32> %c, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				%tex.1 = bitcast <4 x float> %tex to <4 x i32>
				%tex.2 = extractelement <4 x i32> %tex.1, i32 0
				%gep = getelementptr float, float addrspace(1)* %ptr, i32 %tex.2
				%wr = extractelement <4 x float> %tex, i32 1
				store float %wr, float addrspace(1)* %gep
				ret <4 x float> %tex
				}

				; Check that WQM is re-enabled when required.
				;
				;CHECK-LABEL: {{^}}test4:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: v_mul_lo_i32 [[MUL:v[0-9]+]], v0, v1
				;CHECK: s_and_b64 exec, exec, [[ORIG]]
				;CHECK: store
				;CHECK: s_wqm_b64 exec, exec
				;CHECK: image_sample v[0:3], [[MUL]], s[0:7], s[8:11] dmask:0xf
				define <4 x float> @test4(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, i32 %c, i32 %d, float %data) #0 {
				main_body:
				%c.1 = mul i32 %c, %d
				%gep = getelementptr float, float addrspace(1)* %ptr, i32 %c.1
				store float %data, float addrspace(1)* %gep
				%tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %c.1, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				ret <4 x float> %tex
				}

				; Check a case of one branch of an if-else requiring WQM, the other requiring
				; exact.
				;
				; Note: In this particular case, the save-and-restore could be avoided if the
				; analysis understood that the two branches of the if-else are mutually
				; exclusive.
				;
				;CHECK-LABEL: {{^}}test_control_flow_0:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: %ELSE
				;CHECK: s_and_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]], [[ORIG]]
				;CHECK: store
				;CHECK: s_mov_b64 exec, [[SAVED]]
				;CHECK: %IF
				;CHECK: image_sample
				define float @test_control_flow_0(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, i32 %c, i32 %z, float %data) #0 {
				main_body:
				%cmp = icmp eq i32 %z, 0
				br i1 %cmp, label %IF, label %ELSE

				IF:
				%tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %c, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				%data.if = extractelement <4 x float> %tex, i32 0
				br label %END

				ELSE:
				%gep = getelementptr float, float addrspace(1)* %ptr, i32 %c
				store float %data, float addrspace(1)* %gep
				br label %END

				END:
				%r = phi float [ %data.if, %IF ], [ %data, %ELSE ]
				ret float %r
				}

				; Reverse branch order compared to the previous test.
				;
				;CHECK-LABEL: {{^}}test_control_flow_1:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: %IF
				;CHECK: image_sample
				;CHECK: %Flow
				;CHECK-NEXT: s_or_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]],
				;CHECK-NEXT: s_and_b64 exec, exec, [[ORIG]]
				;CHECK-NEXT: s_and_b64 [[SAVED]], exec, [[SAVED]]
				;CHECK-NEXT: s_xor_b64 exec, exec, [[SAVED]]
				;CHECK-NEXT: %ELSE
				;CHECK: store
				;CHECK: %END
				define float @test_control_flow_1(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, i32 %c, i32 %z, float %data) #0 {
				main_body:
				%cmp = icmp eq i32 %z, 0
				br i1 %cmp, label %ELSE, label %IF

				IF:
				%tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %c, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				%data.if = extractelement <4 x float> %tex, i32 0
				br label %END

				ELSE:
				%gep = getelementptr float, float addrspace(1)* %ptr, i32 %c
				store float %data, float addrspace(1)* %gep
				br label %END

				END:
				%r = phi float [ %data.if, %IF ], [ %data, %ELSE ]
				ret float %r
				}

				; Check that branch conditions are properly marked as needing WQM...
				;
				;CHECK-LABEL: {{^}}test_control_flow_2:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: s_and_b64 exec, exec, [[ORIG]]
				;CHECK: store
				;CHECK: s_wqm_b64 exec, exec
				;CHECK: load
				;CHECK: s_and_b64 exec, exec, [[ORIG]]
				;CHECK: store
				;CHECK: s_wqm_b64 exec, exec
				;CHECK: v_cmp
				define <4 x float> @test_control_flow_2(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, <3 x i32> %idx, <2 x float> %data, i32 %coord) #0 {
				main_body:
				%idx.1 = extractelement <3 x i32> %idx, i32 0
				%gep.1 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.1
				%data.1 = extractelement <2 x float> %data, i32 0
				store float %data.1, float addrspace(1)* %gep.1

				; The load that determines the branch (and should therefore be WQM) is
				; surrounded by stores that require disabled WQM.
				%idx.2 = extractelement <3 x i32> %idx, i32 1
				%gep.2 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.2
				%z = load float, float addrspace(1)* %gep.2

				%idx.3 = extractelement <3 x i32> %idx, i32 2
				%gep.3 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.3
				%data.3 = extractelement <2 x float> %data, i32 1
				store float %data.3, float addrspace(1)* %gep.3

				%cc = fcmp ogt float %z, 0.0
				br i1 %cc, label %IF, label %ELSE

				IF:
				%coord.IF = mul i32 %coord, 3
				br label %END

				ELSE:
				%coord.ELSE = mul i32 %coord, 4
				br label %END

				END:
				%coord.END = phi i32 [ %coord.IF, %IF ], [ %coord.ELSE, %ELSE ]
				%tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord.END, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				ret <4 x float> %tex
				}

				; ... but only if they really do need it.
				;
				;CHECK-LABEL: {{^}}test_control_flow_3:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: image_sample
				;CHECK: s_and_b64 exec, exec, [[ORIG]]
				;CHECK: store
				;CHECK: load
				;CHECK: store
				;CHECK: v_cmp
				define float @test_control_flow_3(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, <3 x i32> %idx, <2 x float> %data, i32 %coord) #0 {
				main_body:
				%tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				%tex.1 = extractelement <4 x float> %tex, i32 0

				%idx.1 = extractelement <3 x i32> %idx, i32 0
				%gep.1 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.1
				%data.1 = extractelement <2 x float> %data, i32 0
				store float %data.1, float addrspace(1)* %gep.1

				%idx.2 = extractelement <3 x i32> %idx, i32 1
				%gep.2 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.2
				%z = load float, float addrspace(1)* %gep.2

				%idx.3 = extractelement <3 x i32> %idx, i32 2
				%gep.3 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.3
				%data.3 = extractelement <2 x float> %data, i32 1
				store float %data.3, float addrspace(1)* %gep.3

				%cc = fcmp ogt float %z, 0.0
				br i1 %cc, label %IF, label %ELSE

				IF:
				%tex.IF = fmul float %tex.1, 3.0
				br label %END

				ELSE:
				%tex.ELSE = fmul float %tex.1, 4.0
				br label %END

				END:
				%tex.END = phi float [ %tex.IF, %IF ], [ %tex.ELSE, %ELSE ]
				ret float %tex.END
				}

				; Another test that failed at some point because of terminator handling.
				;
				;CHECK-LABEL: {{^}}test_control_flow_4:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: %IF
				;CHECK: load
				;CHECK: s_and_saveexec_b64 [[SAVE:s\[[0-9]+:[0-9]+\]]], [[ORIG]]
				;CHECK: store
				;CHECK: s_mov_b64 exec, [[SAVE]]
				;CHECK: %END
				;CHECK: image_sample
				define <4 x float> @test_control_flow_4(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, i32 %coord, i32 %y, float %z) #0 {
				main_body:
				%cond = icmp eq i32 %y, 0
				br i1 %cond, label %IF, label %END

				IF:
				%data = load float, float addrspace(1)* %ptr
				%gep = getelementptr float, float addrspace(1)* %ptr, i32 1
				store float %data, float addrspace(1)* %gep
				br label %END

				END:
				%tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				ret <4 x float> %tex
				}

				; Kill is performed in WQM mode so that uniform kill behaves correctly ...
				;
				;CHECK-LABEL: {{^}}test_kill_0:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: image_sample
				;CHECK: s_and_b64 exec, exec, [[ORIG]]
				;SI: buffer_store_dword
				;VI: flat_store_dword
				;CHECK: s_wqm_b64 exec, exec
				;CHECK: v_cmpx_
				;CHECK: s_and_saveexec_b64 [[SAVE:s\[[0-9]+:[0-9]+\]]], [[ORIG]]
				;SI: buffer_store_dword
				;VI: flat_store_dword
				;CHECK: s_mov_b64 exec, [[SAVE]]
				;CHECK: image_sample
				define <4 x float> @test_kill_0(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, <2 x i32> %idx, <2 x float> %data, i32 %coord, i32 %coord2, float %z) #0 {
				main_body:
				%tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)

				%idx.0 = extractelement <2 x i32> %idx, i32 0
				%gep.0 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.0
				%data.0 = extractelement <2 x float> %data, i32 0
				store float %data.0, float addrspace(1)* %gep.0

				call void @llvm.AMDGPU.kill(float %z)

				%idx.1 = extractelement <2 x i32> %idx, i32 1
				%gep.1 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.1
				%data.1 = extractelement <2 x float> %data, i32 1
				store float %data.1, float addrspace(1)* %gep.1

				%tex2 = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord2, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)
				%out = fadd <4 x float> %tex, %tex2

				ret <4 x float> %out
				}

				; ... but only if WQM is necessary.
				;
				;CHECK-LABEL: {{^}}test_kill_1:
				;CHECK-NEXT: ; %main_body
				;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
				;CHECK-NEXT: s_wqm_b64 exec, exec
				;CHECK: image_sample
				;CHECK: s_and_b64 exec, exec, [[ORIG]]
				;SI: buffer_store_dword
				;VI: flat_store_dword
				;CHECK-NOT: wqm
				;CHECK: v_cmpx_
				define <4 x float> @test_kill_1(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, float addrspace(1)* inreg %ptr, i32 %idx, float %data, i32 %coord, i32 %coord2, float %z) #0 {
				main_body:
				%tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord, <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0)

				%gep = getelementptr float, float addrspace(1)* %ptr, i32 %idx
				store float %data, float addrspace(1)* %gep

				call void @llvm.AMDGPU.kill(float %z)

				ret <4 x float> %tex
				}

				declare void @llvm.amdgcn.image.store.v4i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1

				declare <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #2

				declare <4 x float> @llvm.SI.image.sample.i32(i32, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3
				declare <4 x float> @llvm.SI.image.sample.v4i32(<4 x i32>, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3

				declare void @llvm.AMDGPU.kill(float)
				declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float, float, float)

				attributes #0 = { "ShaderType"="0" }
				attributes #1 = { nounwind }
				attributes #2 = { nounwind readonly }
				attributes #3 = { nounwind readnone }