This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Reworked SIFixWWMLiveness
ClosedPublic

Authored by tpr on May 11 2018, 7:13 AM.

Download Raw Diff

Details

Reviewers

cwabbott
nhaehnle
mareko

Group Reviewers

Restricted Project

Commits

rGabd85fb1f57f: [AMDGPU] Reworked SIFixWWMLiveness
rL338783: [AMDGPU] Reworked SIFixWWMLiveness

Summary

I encountered some problems with SIFixWWMLiveness when WWM is in a loop:

It sometimes gave invalid MIR where there is some control flow path to the new implicit use of a register on EXIT_WWM that does not pass through any def.

There were lots of false positives of registers that needed to have an implicit use added to EXIT_WWM.

Adding an implicit use to EXIT_WWM (and adding an implicit def just before the WWM code, which I tried in order to fix (1)) caused lots of the values to be spilled and reloaded unnecessarily.

This commit is a rework of SIFixWWMLiveness, with the following changes:

Instead of considering any register with a def that can reach the WWM code and a def that can be reached from the WWM code, it now considers three specific cases that need to be handled.

A register that needs liveness over WWM to be synthesized now has it done by adding itself as an implicit use to defs other than the dominant one.

Also added the following fixmes:

FIXME: We should detect whether a register in one of the above
categories is already live at the WWM code before deciding to add the
implicit uses to synthesize its liveness.

FIXME: I believe this whole scheme may be flawed due to the possibility
of the register allocator doing live interval splitting.

Change-Id: Ie7fba0ede0378849181df3f1a9a7a39ed1a94a94

Diff Detail

Repository: rL LLVM

Event Timeline

tpr created this revision.May 11 2018, 7:13 AM

Herald added subscribers: llvm-commits, t-tye, dstuttard and 5 others. · View Herald TranscriptMay 11 2018, 7:13 AM

tpr added reviewers: cwabbott, nhaehnle.May 11 2018, 7:16 AM

tpr added reviewers: mareko, Restricted Project.May 17 2018, 12:51 PM

arsenm added inline comments.May 18 2018, 1:31 AM

lib/Target/AMDGPU/SIFixWWMLiveness.cpp
238 ↗	(On Diff #146321)	Without looking at this too closely, I would like to avoid adding assumptions where the number of successors of a block is exactly 2 to avoid future pain when control flow lowering is changed. Is there a more specific property you can check instead?

tpr added inline comments.May 23 2018, 7:36 AM

lib/Target/AMDGPU/SIFixWWMLiveness.cpp
238 ↗	(On Diff #146321)	It is checking for a very specific case. Currently I believe the structurizer will always generate if..then..endif as the only way of having a non-uniform phi in non-loop cases. This seems the best way to do it for now, although the whole thing is a hack that would possibly need core LLVM support to fix properly.

Ping @arsenm

Is this ok?

Sorry I didn't get to this earlier, but would you mind holding off on this a little bit? I'd like to think this through.

@cwabbott Did you have a chance to look at this?

OK will hold off for a bit.

I've had some time to let this sink in now.

Let me start by saying that I'm very skeptical of any approach that relies on LoopInfo or RegionInfo for correctness, and case distinctions in general.

That said, I do like the idea of getting rid of implicit uses on WWM, and instead turning defs of variables into partial defs. That fundamentally makes sense.

I also think your second FIXME is spot-on: if the register allocator does any kind of splitting / spilling inside WWM code, things are likely to get screwed up. I couldn't think of a way to fix that without going into the guts of the register allocator, so the following just ignores that problem entirely for now.

How about the following alternative logic for where partial defs are needed. Move this pass to before PHI elimination (but still after WQM), and consider all PHIs of vector registers. For every operand X of the PHI node consider its unique def D. If any of the defs of the other operands of the PHI node can reach[1] D via a WWM region, then add an appropriate implicit use to D. In the general case, this may require creating new PHI nodes to preserve SSA form, so perhaps the MachineSSAUpdater can be used for some of this (this would also help with adding IMPLICIT_DEFs where needed).

With this approach, we should be able to feel much more confident about the correctness of it all (except for the issue with register spills), and having fewer steps between PHI elimination and register allocation is always a good thing. It also preserves the nice property of your approach that we don't add excessive implicit uses.

If a normal graph walk is used to determine reachability in [1], then this is a somewhat conservative approximation in some uniform control flow cases, but I hope we can accept this as a first step. I haven't given much thought yet to how we could do better in general.

Thanks, sounds feasible. I'll give it a go.

Actually that plan does not work for one of the cases handled in the current change: when there is a def towards the end of a loop and a use outside the loop, and there is some WWM inside the loop. There is no phi node at the top of the loop at all for that, but we need to spot the case and introduce one (and make its value undef from the loop pre-header) because the liveness needs to go round the loop to allow for lanes that were in the loop but have become inactive because they have already decided to leave the loop.

Ping @nhaehnle -- any comment on the flaw in your plan? Can we just go with this change as it is better than what is there now?

First, thanks for working on this issue, it's on our table since ages and I couldn't find a proper solution yet.
I tested your patch with my shader_ballot implementation on DOOM, but still get a bunch of artifacts (https://github.com/daniel-schuermann/mesa/commits/shader_ballot).
It might also be that my implementation is bugged, so I'd be glad in case you have AMDVLK branch with working

OpExtension "SPV_AMD_gcn_shader"
OpExtension "SPV_AMD_shader_ballot"
OpExtension "SPV_AMD_shader_trinary_minmax"
OpExtension "SPV_KHR_shader_ballot"
OpExtension "SPV_KHR_subgroup_vote"
OpCapability SubgroupBallotKHR
OpCapability SubgroupVoteKHR
OpCapability Group

if you let me know if DOOM works for you.

In D46756#1138135, @tpr wrote:

Actually that plan does not work for one of the cases handled in the current change: when there is a def towards the end of a loop and a use outside the loop, and there is some WWM inside the loop. There is no phi node at the top of the loop at all for that, but we need to spot the case and introduce one (and make its value undef from the loop pre-header) because the liveness needs to go round the loop to allow for lanes that were in the loop but have become inactive because they have already decided to leave the loop.

According to literature[1], there are exactly three types of phi-nodes:
γ - functions represent the joining point of different paths created by an “if-then-else” branch in the source program.
μ - functions, which only exist at loop headers, merge initial and loop-carried values.
η - functions represent values that leave a loop.

The case you are talking about requires an η phi-node after the loop exit, which should be there if we are in LCSSA-form. (If not, we'd have to lower to LCSSA.)
I think, case distinction makes sense here as these are the only variations of phi-nodes. For the η-node, we'd have to check if a Def of a phi-src can reach itself (as it's inside a loop) via a WWM region to resolve this issue.
The same might be true for the μ phi-nodes (e.g. if we have only one initial and one loop-carried value and afterwards a WWM region inside the loop, then neither the initial value reaches the loop-carried value via WWM nor the other way around).
The other cases according to @nhaehnle's proposal.

[1] Tu, Peng, and David Padua. "Efficient building and placing of gating functions." ACM SIGPLAN Notices 30.6 (1995): 47-55.

Hi Daniel

Thanks for the comments. The LCSSA thing sounds good, except it does not overcome Nicolai's objection to relying on LoopInfo for semantic correctness.

I guess it might be possible to find the loop-exiting values as a def that reaches itself and has at least one use that does not reach the def.

Remember that all of this still has the fatal flaw that the register allocator may split a register and completely negate our attempts to add artificial liveness.

Agreed that LCSSA doesn't help either.

I think we have a consensus that the status quo is not ideal even with this patch (and even if my earlier suggestion could be made to work), but we also haven't made progress on how to fix this whole complex of issues properly. In the meantime, this patch does fix some bugs, so let's give it a shot.

This revision is now accepted and ready to land.Jul 31 2018, 5:34 AM

Closed by commit rL338783: [AMDGPU] Reworked SIFixWWMLiveness (authored by tpr). · Explain WhyAug 2 2018, 4:32 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIFixWWMLiveness.cpp

387 lines

test/

CodeGen/

AMDGPU/

fix-wwm-liveness.mir

116 lines

wqm.ll

39 lines

Diff 158859

llvm/trunk/lib/Target/AMDGPU/SIFixWWMLiveness.cpp

	//===-- SIFixWWMLiveness.cpp - Fix WWM live intervals ---------===//			//===-- SIFixWWMLiveness.cpp - Fix WWM live intervals ---------===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	/// \file			/// \file
	/// Computations in WWM can overwrite values in inactive channels for			/// Computations in WWM can overwrite values in inactive channels for
	/// variables that the register allocator thinks are dead. This pass adds fake			/// variables that the register allocator thinks are dead. This pass adds fake
	/// uses of those variables to WWM instructions to make sure that they aren't			/// uses of those variables to their def(s) to make sure that they aren't
	/// overwritten.			/// overwritten.
	///			///
	/// As an example, consider this snippet:			/// As an example, consider this snippet:
	/// %vgpr0 = V_MOV_B32_e32 0.0			/// %vgpr0 = V_MOV_B32_e32 0.0
	/// if (...) {			/// if (...) {
	/// %vgpr1 = ...			/// %vgpr1 = ...
	/// %vgpr2 = WWM killed %vgpr1			/// %vgpr2 = WWM killed %vgpr1
	/// ... = killed %vgpr2			/// ... = killed %vgpr2
	/// %vgpr0 = V_MOV_B32_e32 1.0			/// %vgpr0 = V_MOV_B32_e32 1.0
	/// }			/// }
	/// ... = %vgpr0			/// ... = %vgpr0
	///			///
	/// The live intervals of %vgpr0 don't overlap with those of %vgpr1. Normally,			/// The live intervals of %vgpr0 don't overlap with those of %vgpr1. Normally,
	/// we can safely allocate %vgpr0 and %vgpr1 in the same register, since			/// we can safely allocate %vgpr0 and %vgpr1 in the same register, since
	/// writing %vgpr1 would only write to channels that would be clobbered by the			/// writing %vgpr1 would only write to channels that would be clobbered by the
	/// second write to %vgpr0 anyways. But if %vgpr1 is written with WWM enabled,			/// second write to %vgpr0 anyways. But if %vgpr1 is written with WWM enabled,
	/// it would clobber even the inactive channels for which the if-condition is			/// it would clobber even the inactive channels for which the if-condition is
	/// false, for which %vgpr0 is supposed to be 0. This pass adds an implicit use			/// false, for which %vgpr0 is supposed to be 0. This pass adds an implicit use
	/// of %vgpr0 to the WWM instruction to make sure they aren't allocated to the			/// of %vgpr0 to its def to make sure they aren't allocated to the
	/// same register.			/// same register.
	///			///
	/// In general, we need to figure out what registers might have their inactive			/// In general, we need to figure out what registers might have their inactive
	/// channels which are eventually used accidentally clobbered by a WWM			/// channels which are eventually used accidentally clobbered by a WWM
	/// instruction. We approximate this using two conditions:			/// instruction. We do that by spotting three separate cases of registers:
	///			///
	/// 1. A definition of the variable reaches the WWM instruction.			/// 1. A "then phi": the value resulting from phi elimination of a phi node at
	/// 2. The variable would be live at the WWM instruction if all its defs were			/// the end of an if..endif. If there is WWM code in the "then", then we
	/// partial defs (i.e. considered as a use), ignoring normal uses.			/// make the def at the end of the "then" branch a partial def by adding an
	///			/// implicit use of the register.
	/// If a register matches both conditions, then we add an implicit use of it to			///
	/// the WWM instruction. Condition #2 is the heart of the matter: every			/// 2. A "loop exit register": a value written inside a loop but used outside the
	/// definition is really a partial definition, since every VALU instruction is			/// loop, where there is WWM code inside the loop (the case in the example
	/// implicitly predicated. We can usually ignore this, but WWM forces us not			/// above). We add an implicit_def of the register in the loop pre-header,
	/// to. Condition #1 prevents false positives if the variable is undefined at			/// and make the original def a partial def by adding an implicit use of the
	/// the WWM instruction anyways. This is overly conservative in certain cases,			/// register.
	/// especially in uniform control flow, but this is a workaround anyways until			///
	/// LLVM gains the notion of predicated uses and definitions of variables.			/// 3. A "loop exit phi": the value resulting from phi elimination of a phi node
				/// in a loop header. If there is WWM code inside the loop, then we make all
				/// defs inside the loop partial defs by adding an implicit use of the
				/// register on each one.
				///
				/// Note that we do not need to consider an if..else..endif phi. We only need to
				/// consider non-uniform control flow, and control flow structurization would
				/// have transformed a non-uniform if..else..endif into two if..endifs.
				///
				/// The analysis to detect these cases relies on a property of the MIR
				/// arising from this pass running straight after PHIElimination and before any
				/// coalescing: that any virtual register with more than one definition must be
				/// the new register added to lower a phi node by PHIElimination.
				///
				/// FIXME: We should detect whether a register in one of the above categories is
				/// already live at the WWM code before deciding to add the implicit uses to
				/// synthesize its liveness.
				///
				/// FIXME: I believe this whole scheme may be flawed due to the possibility of
				/// the register allocator doing live interval splitting.
	///			///
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AMDGPU.h"			#include "AMDGPU.h"
	#include "AMDGPUSubtarget.h"			#include "AMDGPUSubtarget.h"
	#include "SIInstrInfo.h"			#include "SIInstrInfo.h"
	#include "SIRegisterInfo.h"			#include "SIRegisterInfo.h"
	#include "MCTargetDesc/AMDGPUMCTargetDesc.h"			#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
	#include "llvm/ADT/DepthFirstIterator.h"			#include "llvm/ADT/DepthFirstIterator.h"
	#include "llvm/ADT/SparseBitVector.h"			#include "llvm/ADT/SparseBitVector.h"
	#include "llvm/CodeGen/LiveIntervals.h"			#include "llvm/CodeGen/LiveIntervals.h"
				#include "llvm/CodeGen/MachineDominators.h"
	#include "llvm/CodeGen/MachineFunctionPass.h"			#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineLoopInfo.h"
	#include "llvm/CodeGen/Passes.h"			#include "llvm/CodeGen/Passes.h"
	#include "llvm/CodeGen/TargetRegisterInfo.h"			#include "llvm/CodeGen/TargetRegisterInfo.h"

	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "si-fix-wwm-liveness"			#define DEBUG_TYPE "si-fix-wwm-liveness"

	namespace {			namespace {

	class SIFixWWMLiveness : public MachineFunctionPass {			class SIFixWWMLiveness : public MachineFunctionPass {
	private:			private:
				MachineDominatorTree *DomTree;
				MachineLoopInfo *LoopInfo;
	LiveIntervals *LIS = nullptr;			LiveIntervals *LIS = nullptr;
				const SIInstrInfo *TII;
	const SIRegisterInfo *TRI;			const SIRegisterInfo *TRI;
	MachineRegisterInfo *MRI;			MachineRegisterInfo *MRI;

				std::vector<MachineInstr *> WWMs;
				std::vector<MachineOperand *> ThenDefs;
				std::vector<std::pair<MachineOperand , MachineLoop >> LoopExitDefs;
				std::vector<std::pair<MachineOperand , MachineLoop >> LoopPhiDefs;

	public:			public:
	static char ID;			static char ID;

	SIFixWWMLiveness() : MachineFunctionPass(ID) {			SIFixWWMLiveness() : MachineFunctionPass(ID) {
	initializeSIFixWWMLivenessPass(*PassRegistry::getPassRegistry());			initializeSIFixWWMLivenessPass(*PassRegistry::getPassRegistry());
	}			}

	bool runOnMachineFunction(MachineFunction &MF) override;			bool runOnMachineFunction(MachineFunction &MF) override;

	bool runOnWWMInstruction(MachineInstr &MI);

	void addDefs(const MachineInstr &MI, SparseBitVector<> &set);

	StringRef getPassName() const override { return "SI Fix WWM Liveness"; }			StringRef getPassName() const override { return "SI Fix WWM Liveness"; }

	void getAnalysisUsage(AnalysisUsage &AU) const override {			void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequiredID(MachineDominatorsID);
				AU.addRequiredID(MachineLoopInfoID);
	// Should preserve the same set that TwoAddressInstructions does.			// Should preserve the same set that TwoAddressInstructions does.
	AU.addPreserved<SlotIndexes>();			AU.addPreserved<SlotIndexes>();
	AU.addPreserved<LiveIntervals>();			AU.addPreserved<LiveIntervals>();
	AU.addPreservedID(LiveVariablesID);			AU.addPreservedID(LiveVariablesID);
	AU.addPreservedID(MachineLoopInfoID);			AU.addPreservedID(MachineLoopInfoID);
	AU.addPreservedID(MachineDominatorsID);			AU.addPreservedID(MachineDominatorsID);
	AU.setPreservesCFG();			AU.setPreservesCFG();
	MachineFunctionPass::getAnalysisUsage(AU);			MachineFunctionPass::getAnalysisUsage(AU);
	}			}

				private:
				void processDef(MachineOperand &DefOpnd);
				bool processThenDef(MachineOperand *DefOpnd);
				bool processLoopExitDef(MachineOperand DefOpnd, MachineLoop Loop);
				bool processLoopPhiDef(MachineOperand DefOpnd, MachineLoop Loop);
	};			};

	} // End anonymous namespace.			} // End anonymous namespace.

	INITIALIZE_PASS(SIFixWWMLiveness, DEBUG_TYPE,			INITIALIZE_PASS_BEGIN(SIFixWWMLiveness, DEBUG_TYPE,
				"SI fix WWM liveness", false, false)
				INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
				INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)
				INITIALIZE_PASS_END(SIFixWWMLiveness, DEBUG_TYPE,
	"SI fix WWM liveness", false, false)			"SI fix WWM liveness", false, false)

	char SIFixWWMLiveness::ID = 0;			char SIFixWWMLiveness::ID = 0;

	char &llvm::SIFixWWMLivenessID = SIFixWWMLiveness::ID;			char &llvm::SIFixWWMLivenessID = SIFixWWMLiveness::ID;

	FunctionPass *llvm::createSIFixWWMLivenessPass() {			FunctionPass *llvm::createSIFixWWMLivenessPass() {
	return new SIFixWWMLiveness();			return new SIFixWWMLiveness();
	}			}

	void SIFixWWMLiveness::addDefs(const MachineInstr &MI, SparseBitVector<> &Regs)
	{
	for (const MachineOperand &Op : MI.defs()) {
	if (Op.isReg()) {
	unsigned Reg = Op.getReg();
	if (TRI->isVGPR(*MRI, Reg))
	Regs.set(Reg);
	}
	}
	}

	bool SIFixWWMLiveness::runOnWWMInstruction(MachineInstr &WWM) {
	MachineBasicBlock *MBB = WWM.getParent();

	// Compute the registers that are live out of MI by figuring out which defs
	// are reachable from MI.
	SparseBitVector<> LiveOut;

	for (auto II = MachineBasicBlock::iterator(WWM), IE =
	MBB->end(); II != IE; ++II) {
	addDefs(*II, LiveOut);
	}

	for (df_iterator<MachineBasicBlock *> I = ++df_begin(MBB),
	E = df_end(MBB);
	I != E; ++I) {
	for (const MachineInstr &MI : **I) {
	addDefs(MI, LiveOut);
	}
	}

	// Compute the registers that reach MI.
	SparseBitVector<> Reachable;

	for (auto II = ++MachineBasicBlock::reverse_iterator(WWM), IE =
	MBB->rend(); II != IE; ++II) {
	addDefs(*II, Reachable);
	}

	for (idf_iterator<MachineBasicBlock *> I = ++idf_begin(MBB),
	E = idf_end(MBB);
	I != E; ++I) {
	for (const MachineInstr &MI : **I) {
	addDefs(MI, Reachable);
	}
	}

	// find the intersection, and add implicit uses.
	LiveOut &= Reachable;

	bool Modified = false;
	for (unsigned Reg : LiveOut) {
	WWM.addOperand(MachineOperand::CreateReg(Reg, false, /isImp=/true));
	if (LIS) {
	// FIXME: is there a better way to update the live interval?
	LIS->removeInterval(Reg);
	LIS->createAndComputeVirtRegInterval(Reg);
	}
	Modified = true;
	}

	return Modified;
	}

	bool SIFixWWMLiveness::runOnMachineFunction(MachineFunction &MF) {			bool SIFixWWMLiveness::runOnMachineFunction(MachineFunction &MF) {
				LLVM_DEBUG(dbgs() << "SIFixWWMLiveness: function " << MF.getName() << "\n");
	bool Modified = false;			bool Modified = false;

	// This doesn't actually need LiveIntervals, but we can preserve them.			// This doesn't actually need LiveIntervals, but we can preserve them.
	LIS = getAnalysisIfAvailable<LiveIntervals>();			LIS = getAnalysisIfAvailable<LiveIntervals>();

	const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();			const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
	const SIInstrInfo *TII = ST.getInstrInfo();

				TII = ST.getInstrInfo();
	TRI = &TII->getRegisterInfo();			TRI = &TII->getRegisterInfo();
	MRI = &MF.getRegInfo();			MRI = &MF.getRegInfo();

				DomTree = &getAnalysis<MachineDominatorTree>();
				LoopInfo = &getAnalysis<MachineLoopInfo>();

				// Scan the function to find the WWM sections and the candidate registers for
				// having liveness modified.
	for (MachineBasicBlock &MBB : MF) {			for (MachineBasicBlock &MBB : MF) {
	for (MachineInstr &MI : MBB) {			for (MachineInstr &MI : MBB) {
	if (MI.getOpcode() == AMDGPU::EXIT_WWM) {			if (MI.getOpcode() == AMDGPU::EXIT_WWM)
	Modified \|= runOnWWMInstruction(MI);			WWMs.push_back(&MI);
				else {
				for (MachineOperand &DefOpnd : MI.defs()) {
				if (DefOpnd.isReg()) {
				unsigned Reg = DefOpnd.getReg();
				if (TRI->isVGPR(*MRI, Reg))
				processDef(DefOpnd);
				}
				}
	}			}
	}			}
	}			}
				if (!WWMs.empty()) {
				// Synthesize liveness over WWM sections as required.
				for (auto ThenDef : ThenDefs)
				Modified \|= processThenDef(ThenDef);
				for (auto LoopExitDef : LoopExitDefs)
				Modified \|= processLoopExitDef(LoopExitDef.first, LoopExitDef.second);
				for (auto LoopPhiDef : LoopPhiDefs)
				Modified \|= processLoopPhiDef(LoopPhiDef.first, LoopPhiDef.second);
				}

				WWMs.clear();
				ThenDefs.clear();
				LoopExitDefs.clear();
				LoopPhiDefs.clear();

	return Modified;			return Modified;
	}			}

				// During the function scan, process an operand that defines a VGPR.
				// This categorizes the register and puts it in the appropriate list for later
				// use when processing a WWM section.
				void SIFixWWMLiveness::processDef(MachineOperand &DefOpnd) {
				unsigned Reg = DefOpnd.getReg();
				// Get all the defining instructions. For convenience, make Defs[0] the def
				// we are on now.
				SmallVector<const MachineInstr *, 4> Defs;
				Defs.push_back(DefOpnd.getParent());
				for (auto &MI : MRI->def_instructions(Reg)) {
				if (&MI != DefOpnd.getParent())
				Defs.push_back(&MI);
				}
				// Check whether this def dominates all the others. If not, ignore this def.
				// Either it is going to be processed when the scan encounters its other def
				// that dominates all defs, or there is no def that dominates all others.
				// The latter case is an eliminated phi from an if..else..endif or similar,
				// which must be for uniform control flow so can be ignored.
				// Because this pass runs shortly after PHIElimination, we assume that any
				// multi-def register is a lowered phi, and thus has each def in a separate
				// basic block.
				for (unsigned I = 1; I != Defs.size(); ++I) {
				if (!DomTree->dominates(Defs[0]->getParent(), Defs[I]->getParent()))
				return;
				}
				// Check for the case of an if..endif lowered phi: It has two defs, one
				// dominates the other, and there is a single use in a successor of the
				// dominant def.
				// Later we will spot any WWM code inside
				// the "then" clause and turn the second def into a partial def so its
				// liveness goes through the WWM code in the "then" clause.
				if (Defs.size() == 2) {
				auto DomDefBlock = Defs[0]->getParent();
				if (DomDefBlock->succ_size() == 2 && MRI->hasOneUse(Reg)) {
				auto UseBlock = MRI->use_begin(Reg)->getParent()->getParent();
				for (auto Succ : DomDefBlock->successors()) {
				if (Succ == UseBlock) {
				LLVM_DEBUG(dbgs() << printReg(Reg, TRI) << " is a then phi reg\n");
				ThenDefs.push_back(&DefOpnd);
				return;
				}
				}
				}
				}
				// Check for the case of a non-lowered-phi register (single def) that exits
				// a loop, that is, it has a use that is outside a loop that the def is
				// inside. We find the outermost loop that the def is inside but a use is
				// outside. Later we will spot any WWM code inside that loop and then make
				// the def a partial def so its liveness goes round the loop and through the
				// WWM code.
				if (Defs.size() == 1) {
				auto Loop = LoopInfo->getLoopFor(Defs[0]->getParent());
				if (!Loop)
				return;
				bool IsLoopExit = false;
				for (auto &Use : MRI->use_instructions(Reg)) {
				auto UseBlock = Use.getParent();
				if (Loop->contains(UseBlock))
				continue;
				IsLoopExit = true;
				while (auto Parent = Loop->getParentLoop()) {
				if (Parent->contains(UseBlock))
				break;
				Loop = Parent;
				}
				}
				if (!IsLoopExit)
				return;
				LLVM_DEBUG(dbgs() << printReg(Reg, TRI)
				<< " is a loop exit reg with loop header at "
				<< "bb." << Loop->getHeader()->getNumber() << "\n");
				LoopExitDefs.push_back(std::pair<MachineOperand , MachineLoop >(
				&DefOpnd, Loop));
				return;
				}
				// Check for the case of a lowered single-preheader-loop phi, that is, a
				// multi-def register where the dominating def is in the loop pre-header and
				// all other defs are in backedges. Later we will spot any WWM code inside
				// that loop and then make the backedge defs partial defs so the liveness
				// goes through the WWM code.
				// Note that we are ignoring multi-preheader loops on the basis that the
				// structurizer does not allow that for non-uniform loops.
				// There must be a single use in the loop header.
				if (!MRI->hasOneUse(Reg))
				return;
				auto UseBlock = MRI->use_begin(Reg)->getParent()->getParent();
				auto Loop = LoopInfo->getLoopFor(UseBlock);
				if (!Loop \|\| Loop->getHeader() != UseBlock
				\|\| Loop->contains(Defs[0]->getParent())) {
				LLVM_DEBUG(dbgs() << printReg(Reg, TRI)
				<< " is multi-def but single use not in loop header\n");
				return;
				}
				for (unsigned I = 1; I != Defs.size(); ++I) {
				if (!Loop->contains(Defs[I]->getParent()))
				return;
				}
				LLVM_DEBUG(dbgs() << printReg(Reg, TRI)
				<< " is a loop phi reg with loop header at "
				<< "bb." << Loop->getHeader()->getNumber() << "\n");
				LoopPhiDefs.push_back(
				std::pair<MachineOperand , MachineLoop >(&DefOpnd, Loop));
				}

				// Process a then phi def: It has two defs, one dominates the other, and there
				// is a single use in a successor of the dominant def. Here we spot any WWM
				// code inside the "then" clause and turn the second def into a partial def so
				// its liveness goes through the WWM code in the "then" clause.
				bool SIFixWWMLiveness::processThenDef(MachineOperand *DefOpnd) {
				LLVM_DEBUG(dbgs() << "Processing then def: " << *DefOpnd->getParent());
				if (DefOpnd->getParent()->getOpcode() == TargetOpcode::IMPLICIT_DEF) {
				// Ignore if dominating def is undef.
				LLVM_DEBUG(dbgs() << " ignoring as dominating def is undef\n");
				return false;
				}
				unsigned Reg = DefOpnd->getReg();
				// Get the use block, which is the endif block.
				auto UseBlock = MRI->use_instr_begin(Reg)->getParent();
				// Check whether there is WWM code inside the then branch. The WWM code must
				// be dominated by the if but not dominated by the endif.
				bool ContainsWWM = false;
				for (auto WWM : WWMs) {
				if (DomTree->dominates(DefOpnd->getParent()->getParent(), WWM->getParent())
				&& !DomTree->dominates(UseBlock, WWM->getParent())) {
				LLVM_DEBUG(dbgs() << " contains WWM: " << *WWM);
				ContainsWWM = true;
				break;
				}
				}
				if (!ContainsWWM)
				return false;
				// Get the other def.
				MachineInstr *OtherDef = nullptr;
				for (auto &MI : MRI->def_instructions(Reg)) {
				if (&MI != DefOpnd->getParent())
				OtherDef = &MI;
				}
				// Make it a partial def.
				OtherDef->addOperand(MachineOperand::CreateReg(Reg, false, /isImp=/true));
				LLVM_DEBUG(dbgs() << *OtherDef);
				return true;
				}

				// Process a loop exit def, that is, a register with a single use in a loop
				// that has a use outside the loop. Here we spot any WWM code inside that loop
				// and then make the def a partial def so its liveness goes round the loop and
				// through the WWM code.
				bool SIFixWWMLiveness::processLoopExitDef(MachineOperand *DefOpnd,
				MachineLoop *Loop) {
				LLVM_DEBUG(dbgs() << "Processing loop exit def: " << *DefOpnd->getParent());
				// Check whether there is WWM code inside the loop.
				bool ContainsWWM = false;
				for (auto WWM : WWMs) {
				if (Loop->contains(WWM->getParent())) {
				LLVM_DEBUG(dbgs() << " contains WWM: " << *WWM);
				ContainsWWM = true;
				break;
				}
				}
				if (!ContainsWWM)
				return false;
				unsigned Reg = DefOpnd->getReg();
				// Add a new implicit_def in loop preheader(s).
				for (auto Pred : Loop->getHeader()->predecessors()) {
				if (!Loop->contains(Pred)) {
				auto ImplicitDef = BuildMI(*Pred, Pred->getFirstTerminator(), DebugLoc(),
				TII->get(TargetOpcode::IMPLICIT_DEF), Reg);
				LLVM_DEBUG(dbgs() << *ImplicitDef);
				(void)ImplicitDef;
				}
				}
				// Make the original def partial.
				DefOpnd->getParent()->addOperand(MachineOperand::CreateReg(
				Reg, false, /isImp=/true));
				LLVM_DEBUG(dbgs() << *DefOpnd->getParent());
				return true;
				}

				// Process a loop phi def, that is, a multi-def register where the dominating
				// def is in the loop pre-header and all other defs are in backedges. Here we
				// spot any WWM code inside that loop and then make the backedge defs partial
				// defs so the liveness goes through the WWM code.
				bool SIFixWWMLiveness::processLoopPhiDef(MachineOperand *DefOpnd,
				MachineLoop *Loop) {
				LLVM_DEBUG(dbgs() << "Processing loop phi def: " << *DefOpnd->getParent());
				// Check whether there is WWM code inside the loop.
				bool ContainsWWM = false;
				for (auto WWM : WWMs) {
				if (Loop->contains(WWM->getParent())) {
				LLVM_DEBUG(dbgs() << " contains WWM: " << *WWM);
				ContainsWWM = true;
				break;
				}
				}
				if (!ContainsWWM)
				return false;
				unsigned Reg = DefOpnd->getReg();
				// Remove kill mark from uses.
				for (auto &Use : MRI->use_operands(Reg))
				Use.setIsKill(false);
				// Make all defs except the dominating one partial defs.
				SmallVector<MachineInstr *, 4> Defs;
				for (auto &Def : MRI->def_instructions(Reg))
				Defs.push_back(&Def);
				for (auto Def : Defs) {
				if (DefOpnd->getParent() == Def)
				continue;
				Def->addOperand(MachineOperand::CreateReg(Reg, false, /isImp=/true));
				LLVM_DEBUG(dbgs() << *Def);
				}
				return true;
				}

llvm/trunk/test/CodeGen/AMDGPU/fix-wwm-liveness.mir

# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-fix-wwm-liveness -o - %s \| FileCheck %s		# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-fix-wwm-liveness -o - %s \| FileCheck %s
#CHECK: $exec = EXIT_WWM killed %19, implicit %21
		# Test a then phi value.
		#CHECK: test_wwm_liveness_then_phi
		#CHECK: %21:vgpr_32 = V_MOV_B32_e32 1, implicit $exec, implicit %21

---		---
name: test_wwm_liveness		name: test_wwm_liveness_then_phi
alignment: 0		alignment: 0
exposesReturnsTwice: false		exposesReturnsTwice: false
legalized: false		legalized: false
regBankSelected: false		regBankSelected: false
selected: false		selected: false
tracksRegLiveness: true		tracksRegLiveness: true
registers:		registers:
- { id: 0, class: sreg_64, preferred-register: '' }		- { id: 0, class: sreg_64, preferred-register: '' }
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	bb.1:
BUFFER_STORE_DWORD_OFFSET killed %18, killed %15, 0, 0, 0, 0, 0, implicit $exec :: (store 4)		BUFFER_STORE_DWORD_OFFSET killed %18, killed %15, 0, 0, 0, 0, 0, implicit $exec :: (store 4)

bb.2:		bb.2:
$exec = S_OR_B64 $exec, killed %0, implicit-def $scc		$exec = S_OR_B64 $exec, killed %0, implicit-def $scc
$vgpr0 = COPY killed %21		$vgpr0 = COPY killed %21
SI_RETURN_TO_EPILOG killed $vgpr0		SI_RETURN_TO_EPILOG killed $vgpr0

...		...

		# Test a loop with a loop exit value and a loop phi.
		#CHECK: test_wwm_liveness_loop
		#CHECK: %4:vgpr_32 = IMPLICIT_DEF
		#CHECK: bb.1:
		#CHECK: %4:vgpr_32 = FLAT_LOAD_DWORD{{.*}}, implicit %4
		#CHECK: %27:vgpr_32 = COPY killed %21, implicit %27

		---
		name: test_wwm_liveness_loop
		alignment: 0
		exposesReturnsTwice: false
		legalized: false
		regBankSelected: false
		selected: false
		failedISel: false
		tracksRegLiveness: true
		registers:
		- { id: 0, class: vgpr_32, preferred-register: '' }
		- { id: 1, class: sreg_32_xm0, preferred-register: '' }
		- { id: 2, class: sreg_64, preferred-register: '' }
		- { id: 3, class: sreg_32_xm0, preferred-register: '' }
		- { id: 4, class: vgpr_32, preferred-register: '' }
		- { id: 5, class: sreg_32_xm0, preferred-register: '' }
		- { id: 6, class: sreg_64, preferred-register: '' }
		- { id: 7, class: sreg_64, preferred-register: '' }
		- { id: 8, class: sreg_64, preferred-register: '' }
		- { id: 9, class: vreg_64, preferred-register: '' }
		- { id: 10, class: vgpr_32, preferred-register: '' }
		- { id: 11, class: vgpr_32, preferred-register: '' }
		- { id: 12, class: vgpr_32, preferred-register: '' }
		- { id: 13, class: sreg_64, preferred-register: '' }
		- { id: 14, class: vreg_64, preferred-register: '' }
		- { id: 15, class: sreg_32_xm0, preferred-register: '' }
		- { id: 16, class: vgpr_32, preferred-register: '' }
		- { id: 17, class: sreg_64, preferred-register: '$vcc' }
		- { id: 18, class: vgpr_32, preferred-register: '' }
		- { id: 19, class: vgpr_32, preferred-register: '' }
		- { id: 20, class: vgpr_32, preferred-register: '' }
		- { id: 21, class: vgpr_32, preferred-register: '' }
		- { id: 22, class: vgpr_32, preferred-register: '' }
		- { id: 23, class: sreg_64, preferred-register: '' }
		- { id: 24, class: sreg_64, preferred-register: '' }
		- { id: 25, class: sreg_64, preferred-register: '' }
		- { id: 26, class: sreg_64, preferred-register: '' }
		- { id: 27, class: vgpr_32, preferred-register: '' }
		liveins:
		frameInfo:
		isFrameAddressTaken: false
		isReturnAddressTaken: false
		hasStackMap: false
		hasPatchPoint: false
		stackSize: 0
		offsetAdjustment: 0
		maxAlignment: 0
		adjustsStack: false
		hasCalls: false
		stackProtector: ''
		maxCallFrameSize: 4294967295
		hasOpaqueSPAdjustment: false
		hasVAStart: false
		hasMustTailInVarArgFunc: false
		localFrameSize: 0
		savePoint: ''
		restorePoint: ''
		fixedStack:
		stack:
		constants:
		body: \|
		bb.0:
		successors: %bb.1(0x80000000)

		%25:sreg_64 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
		%0:vgpr_32 = FLAT_LOAD_DWORD undef %9:vreg_64, 0, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load 4 from `float addrspace(1)* undef`, addrspace 1)
		$exec = EXIT_WWM killed %25
		%12:vgpr_32 = V_MBCNT_LO_U32_B32_e64 -1, 0, implicit $exec
		%7:sreg_64 = S_MOV_B64 0
		%26:sreg_64 = COPY killed %7
		%27:vgpr_32 = COPY killed %12

		bb.1:
		successors: %bb.2(0x04000000), %bb.1(0x7c000000)

		%24:sreg_64 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
		%20:vgpr_32 = COPY killed %27
		%2:sreg_64 = COPY killed %26
		%4:vgpr_32 = FLAT_LOAD_DWORD undef %14:vreg_64, 0, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load 4 from `float addrspace(1)* undef`, addrspace 1)
		$exec = EXIT_WWM killed %24
		%22:vgpr_32 = V_ADD_I32_e32 -1, killed %20, implicit-def dead $vcc, implicit $exec
		%17:sreg_64 = V_CMP_EQ_U32_e64 0, %22, implicit $exec
		%6:sreg_64 = S_OR_B64 killed %17, killed %2, implicit-def $scc
		%21:vgpr_32 = COPY killed %22
		%26:sreg_64 = COPY %6
		%27:vgpr_32 = COPY killed %21
		$exec = S_ANDN2_B64_term $exec, %6
		S_CBRANCH_EXECNZ %bb.1, implicit $exec
		S_BRANCH %bb.2

		bb.2:
		$exec = S_OR_B64 $exec, killed %6, implicit-def $scc
		%23:sreg_64 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
		%18:vgpr_32 = V_ADD_F32_e32 killed %0, killed %4, implicit $exec
		$exec = EXIT_WWM killed %23
		early-clobber %19:vgpr_32 = COPY killed %18, implicit $exec
		$vgpr0 = COPY killed %19
		SI_RETURN_TO_EPILOG killed $vgpr0

		...

llvm/trunk/test/CodeGen/AMDGPU/wqm.ll

Show First 20 Lines • Show All 254 Lines • ▼ Show 20 Lines	main_body:
%temp = fadd float %src1, %src1		%temp = fadd float %src1, %src1
%temp.0 = call float @llvm.amdgcn.wwm.f32(float %temp)		%temp.0 = call float @llvm.amdgcn.wwm.f32(float %temp)
%out = fadd float %temp.0, %temp.0		%out = fadd float %temp.0, %temp.0
%out.0 = call float @llvm.amdgcn.wqm.f32(float %out)		%out.0 = call float @llvm.amdgcn.wqm.f32(float %out)
ret float %out.0		ret float %out.0
}		}

; Check that WWM is turned on correctly across basic block boundaries.		; Check that WWM is turned on correctly across basic block boundaries.
		; if..then..endif version
;		;
;CHECK-LABEL: {{^}}test_wwm6:		;CHECK-LABEL: {{^}}test_wwm6_then:
;CHECK: s_or_saveexec_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], -1		;CHECK: s_or_saveexec_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], -1
;SI-CHECK: buffer_load_dword		;SI-CHECK: buffer_load_dword
;VI-CHECK: flat_load_dword		;VI-CHECK: flat_load_dword
;CHECK: s_mov_b64 exec, [[ORIG]]		;CHECK: s_mov_b64 exec, [[ORIG]]
;CHECK: %if		;CHECK: %if
;CHECK: s_or_saveexec_b64 [[ORIG2:s\[[0-9]+:[0-9]+\]]], -1		;CHECK: s_or_saveexec_b64 [[ORIG2:s\[[0-9]+:[0-9]+\]]], -1
;SI-CHECK: buffer_load_dword		;SI-CHECK: buffer_load_dword
;VI-CHECK: flat_load_dword		;VI-CHECK: flat_load_dword
;CHECK: v_add_f32_e32		;CHECK: v_add_f32_e32
;CHECK: s_mov_b64 exec, [[ORIG2]]		;CHECK: s_mov_b64 exec, [[ORIG2]]
define amdgpu_ps float @test_wwm6() {		define amdgpu_ps float @test_wwm6_then() {
main_body:		main_body:
%src0 = load volatile float, float addrspace(1)* undef		%src0 = load volatile float, float addrspace(1)* undef
; use mbcnt to make sure the branch is divergent		; use mbcnt to make sure the branch is divergent
%lo = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)		%lo = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
%hi = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %lo)		%hi = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %lo)
%cc = icmp uge i32 %hi, 32		%cc = icmp uge i32 %hi, 32
br i1 %cc, label %endif, label %if		br i1 %cc, label %endif, label %if

if:		if:
%src1 = load volatile float, float addrspace(1)* undef		%src1 = load volatile float, float addrspace(1)* undef
%out = fadd float %src0, %src1		%out = fadd float %src0, %src1
%out.0 = call float @llvm.amdgcn.wwm.f32(float %out)		%out.0 = call float @llvm.amdgcn.wwm.f32(float %out)
br label %endif		br label %endif

endif:		endif:
%out.1 = phi float [ %out.0, %if ], [ 0.0, %main_body ]		%out.1 = phi float [ %out.0, %if ], [ 0.0, %main_body ]
ret float %out.1		ret float %out.1
}		}

		; Check that WWM is turned on correctly across basic block boundaries.
		; loop version
		;
		;CHECK-LABEL: {{^}}test_wwm6_loop:
		;CHECK: s_or_saveexec_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], -1
		;SI-CHECK: buffer_load_dword
		;VI-CHECK: flat_load_dword
		;CHECK: s_mov_b64 exec, [[ORIG]]
		;CHECK: %loop
		;CHECK: s_or_saveexec_b64 [[ORIG2:s\[[0-9]+:[0-9]+\]]], -1
		;SI-CHECK: buffer_load_dword
		;VI-CHECK: flat_load_dword
		;CHECK: s_mov_b64 exec, [[ORIG2]]
		define amdgpu_ps float @test_wwm6_loop() {
		main_body:
		%src0 = load volatile float, float addrspace(1)* undef
		; use mbcnt to make sure the branch is divergent
		%lo = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
		%hi = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %lo)
		br label %loop

		loop:
		%counter = phi i32 [ %lo, %main_body ], [ %counter.1, %loop ]
		%src1 = load volatile float, float addrspace(1)* undef
		%out = fadd float %src0, %src1
		%out.0 = call float @llvm.amdgcn.wwm.f32(float %out)
		%counter.1 = sub i32 %counter, 1
		%cc = icmp ne i32 %counter.1, 0
		br i1 %cc, label %loop, label %endloop

		endloop:
		ret float %out.0
		}

; Check that @llvm.amdgcn.set.inactive disables WWM.		; Check that @llvm.amdgcn.set.inactive disables WWM.
;		;
;CHECK-LABEL: {{^}}test_set_inactive1:		;CHECK-LABEL: {{^}}test_set_inactive1:
;CHECK: buffer_load_dword		;CHECK: buffer_load_dword
;CHECK: s_not_b64 exec, exec		;CHECK: s_not_b64 exec, exec
;CHECK: v_mov_b32_e32		;CHECK: v_mov_b32_e32
;CHECK: s_not_b64 exec, exec		;CHECK: s_not_b64 exec, exec
;CHECK: s_or_saveexec_b64 s{{\[[0-9]+:[0-9]+\]}}, -1		;CHECK: s_or_saveexec_b64 s{{\[[0-9]+:[0-9]+\]}}, -1
▲ Show 20 Lines • Show All 511 Lines • Show Last 20 Lines