This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Merge M0 initializations
ClosedPublic

Authored by rampitec on Apr 20 2017, 2:20 AM.

Download Raw Diff

Details

Reviewers

vpykhtin
arsenm

Commits

rGbd5394be3d2b: [AMDGPU] Merge M0 initializations
rL301228: [AMDGPU] Merge M0 initializations

Summary

Merges equivalent initializations of M0 and hoists them into a common
dominator block. Technically the same code can be used with any
register, physical or virtual.

It is off by default because it creates performance regressions instead
of improvements. That is caused by an additional freedom scheduler gets
when M0 gets out of its way, and it is notorious for blowing up register
pressure. This is however needed to create a new scheduler and even to
experiment with it, so it is put under an option until new scheduler is
ready.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Apr 20 2017, 2:20 AM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptApr 20 2017, 2:20 AM

Fixed formatting.

Replaced DenseSet::insert() with DenseSet::append().

Thank you for doing this! I really need it.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
473 ↗	(On Diff #95913)	I'm confused. Why a path from Clobber to From may clobber From? 'From' is a def and having path from Clobber to From clobbers Clober? :-)
478 ↗	(On Diff #95913)	this is a xor
503 ↗	(On Diff #95913)	May be it would be better to exhaust Defs by erase only without having push_back/pop_back?
537 ↗	(On Diff #95913)	Looks like LocalChanged and Changed has the same value, replace with one flag?

rampitec marked 4 inline comments as done.Apr 20 2017, 9:43 AM

rampitec added inline comments.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
473 ↗	(On Diff #95913)	Here From is the position from which we want to move a def. To is a position to move it to. Clobber is a potentially clobbering instruction. So if a clobber is not reachable at both From and To, we are safe to move in respect of that clobber. If one is reachable and other is not, we are not safe, because clobber will hide an initialization either at old or new position, resulting in a different value coming to consumers of the def. The question is can we move if both a reachable, which is checked later in this lambda. For example if clobber is in an entry block and both from and to positions are well after, they both reachable, but there is no clobbering in between.
478 ↗	(On Diff #95913)	We do not have ^^ operator, so it is either casts or potential warnings.
503 ↗	(On Diff #95913)	Here we have removed MI2, but M1 still can be combined with something. We need to repeat the inner loop from the very beginning, thus the push_back. Now we cannot really exhaust Defs and leave it empty. All defs which remain shall be re-added back, so in the next iteration of outer loop processing a different initialization value they become potential clobbers themselves.
537 ↗	(On Diff #95913)	LocalChanged basically tells us something was combined in the inner loop and we can just continue. If nothing was combined, then instruction has to be moved from Defs into Visited. When we exhaust current Defs the whole Visited will containing only remaining values. These need to be be moved back into Defs. That is to use remaining defs as potential clobbers for other iterations. Then Changed accumulates return value over all iterations.

vpykhtin added inline comments.Apr 20 2017, 10:38 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
428 ↗	(On Diff #95913)	If I understood correctly this loop ensures that instruction defines only Reg and has one Imm? May be it would be clearer and more reliable if we just check instruction we're intersted in, like moves?
473 ↗	(On Diff #95913)	ok, this helps. I misunderstood from and to.
496 ↗	(On Diff #95913)	Why not to combine these checks with the part above like: if (MDT.dominates(MI1, MI2)) { if (!intereferes(MI2, MI1)) { ... } } else if (MDT.dominates(MI2, MI1)) { if (!intereferes(MI1, MI2)) { ... } } else { auto *MBB = MDT.findNearestCommonDominator(MI1->getParent(), MI2->getParent()); if (!MBB) continue; I = MBB->getFirstNonPHI(); if (!intereferes(MI1, I) && !intereferes(MI2, I)) { ... } } and possibly factor out common code in these parts
503 ↗	(On Diff #95913)	I think you can do like this: Define iterator OI for the outer loop iterating Defs from begin to end (old fashined for loop) Each time internal loop deletes something reset OI to the defs begin You would get rid of push/pop and visited array.
534 ↗	(On Diff #95913)	Why do you need to restore Defs? It looks like it doesn't used anymore

rampitec marked 7 inline comments as done.Apr 20 2017, 10:47 AM

rampitec added inline comments.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
428 ↗	(On Diff #95913)	MO is not defined by mov, it is SI_INIT_M0. Also we need to capture all defs to check for clobbering.
496 ↗	(On Diff #95913)	I can be neither MI1 nor MI2.
503 ↗	(On Diff #95913)	I do not want to reset it and process what I cannot combine and already know it. Then inner loop would need to run only on a slice of Defs.
534 ↗	(On Diff #95913)	That is to use remaining defs as potential clobbers for other iterations.

vpykhtin added inline comments.Apr 20 2017, 11:01 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
496 ↗	(On Diff #95913)	the logic is the same.
503 ↗	(On Diff #95913)	Ok, then you don't need to reset to begin each time, you just need to handle deletion of element pointed by OI by moving OI to the next position, use std::list for Defs.
534 ↗	(On Diff #95913)	Ok, then instead of restoring Defs its better to copy it to before processing into a container better suitable for element removal, such like std::list and use iterator for removal instead of using find

I think this is more complicated than it needs to be and is reinventing most of the logic for a generic hoisting pass. We already know the value of m0 at the important uses. OpenGL might also want to initialize it once in the prolog from a register, and the other uses of m0 are less frequent.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
398–400 ↗	(On Diff #95913)	This shouldn't be needed. m0 would be the only possible live in physreg at this point, and you reserved it
lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗	(On Diff #95913)	My patch specifically avoided doing this. I don't think we want it to be reserved, because this kills all generic copy optimizations.

In D32279#732500, @arsenm wrote:

I think this is more complicated than it needs to be and is reinventing most of the logic for a generic hoisting pass. We already know the value of m0 at the important uses. OpenGL might also want to initialize it once in the prolog from a register, and the other uses of m0 are less frequent.

M0 uses will become more frequent when we implement movrel scratch promotion. Then even without, it is either a single init per function, which is less than needed, or something like this.

rampitec added inline comments.Apr 20 2017, 1:16 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

398–400 ↗

(On Diff #95913)

This might not be needed since it is now reserved, but then if I want to restore SpillToSMEM I would need live-ins. I think to replace RS->isRegUsed() with:

static bool isRegUsed(unsigned Reg, MachineBasicBlock::iterator MI,
                      const SIRegisterInfo *TRI, RegScavenger *RS) {
  const MachineBasicBlock *MBB = MI->getParent();
  const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();

  if (!MRI.isReserved(Reg))
    return RS->isRegUsed(Reg);

  bool Defined = MBB->isLiveIn(Reg);
  bool Found = false;

  for (auto &I : *MBB) {
    if (I == MI) {
      if (!Defined)
        return false;
      Found = true;
    }
    if (Found && I.readsRegister(Reg, TRI))
      return true;
    if (I.modifiesRegister(Reg, TRI)) {
      if (Found)
        return false;
      Defined = true;
    } else if (I.killsRegister(Reg, TRI)) {
      if (Found)
        return false;
      Defined = false;
    }
  }

  return Defined;
}

lib/Target/AMDGPU/SIRegisterInfo.cpp

150 ↗

(On Diff #95913)

Here is the problem: when I hoist init out of the use block, I need to add a live in to satisfy live variable analysis and then verifier after it. Now if I have live-ins, I hit this: "MBB has allocatable live-in, but isn't entry or landing-pad." So I have make it reserved then.

Do you have better ideas?

D30227 has been waiting for review for a long time

In D32279#732555, @arsenm wrote:

D30227 has been waiting for review for a long time

You have no reviewers, so I guess that is why ;)
Seriously, this looks good, although there can be less than efficient side effects if we use M0 for something else besides LDS.
Then another problem, even in this change I have to disable it by default due to problems with register pressure after scheduler. It can only be enabled when scheduler can deal with pressure better, but at the same time we need it to write that scheduler. I do not see an easy way to disable D30227.

Combined checks in the inner loop.

In D32279#732558, @rampitec wrote:

In D32279#732555, @arsenm wrote:

D30227 has been waiting for review for a long time

You have no reviewers, so I guess that is why ;)
Seriously, this looks good, although there can be less than efficient side effects if we use M0 for something else besides LDS.
Then another problem, even in this change I have to disable it by default due to problems with register pressure after scheduler. It can only be enabled when scheduler can deal with pressure better, but at the same time we need it to write that scheduler. I do not see an easy way to disable D30227.

It shouldn't be necessary to disable. Another factor is being able to eliminate the initializations on gfx9 for lds

The other patch may not have the same scheduler problem impact

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

In D32279#732578, @rampitec wrote:

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

In D32279#732613, @arsenm wrote:

In D32279#732578, @rampitec wrote:

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

In D32279#732614, @rampitec wrote:

In D32279#732613, @arsenm wrote:

In D32279#732578, @rampitec wrote:

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

Switched to list from vector.
Removed declspec which breaks MSVC.

rampitec marked 5 inline comments as done.Apr 20 2017, 3:17 PM

In D32279#732624, @arsenm wrote:

In D32279#732614, @rampitec wrote:

In D32279#732613, @arsenm wrote:

In D32279#732578, @rampitec wrote:

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

No

The patch really seems old, there are too much conflicts as I tried to apply it.

Removed live-in logic since M0 end up reserved anyway.

Looks very good now! Thanks!

This revision is now accepted and ready to land.Apr 21 2017, 4:10 AM

Removed temp iterator copies before erase() since there is a sequence point before the actual call.

arsenm added inline comments.Apr 21 2017, 12:27 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
391 ↗	(On Diff #96191)	typo diffeernt
409 ↗	(On Diff #96191)	list is weird. vector or SmallVector?

Fixed typo in comment.

Reverted back to SmallVector for clobbers, it does not need to be list.

rampitec added inline comments.Apr 21 2017, 4:46 PM

lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗	(On Diff #95913)	Actually D30227 should have the same issues with phys live-ins (and have them according to its description), so it also shall fail verification unless M0 is reserved. Two patches are not different here, the problem is not in the approach, but in the very fact of physreg live-in to a block. I can remove reserveRegisterTuples, restore live-in generation and result will be the same as in D30227, it will work, but will not pass verification. Reserving M0 allows it to pass.

arsenm added inline comments.Apr 21 2017, 4:51 PM

lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗	(On Diff #95913)	I think the point of mostly handling in the DAG was the generic instremitter handled figuring out where live in phys regs was absolutely necessary. I also went through a few iterations where m0 wasn't live in, but there was a copy from the one initialized register in the prolog

rampitec added inline comments.Apr 21 2017, 4:54 PM

lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗	(On Diff #95913)	The code I have removed did minimal live-in attribution (i.e. where absolutely necessary). That still does not change the fact that phys live-in only expected in entry and ehpad.

LGTM

Closed by commit rL301228: [AMDGPU] Merge M0 initializations (authored by rampitec). · Explain WhyApr 24 2017, 12:50 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIFixSGPRCopies.cpp

185 lines

SIRegisterInfo.cpp

3 lines

test/

CodeGen/

AMDGPU/

merge-m0.mir

132 lines

spill-m0.ll

22 lines

Diff 96450

llvm/trunk/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "si-fix-sgpr-copies"		#define DEBUG_TYPE "si-fix-sgpr-copies"

		static cl::opt<bool> EnableM0Merge(
		"amdgpu-enable-merge-m0",
		cl::desc("Merge and hoist M0 initializations"),
		cl::init(false));

namespace {		namespace {

class SIFixSGPRCopies : public MachineFunctionPass {		class SIFixSGPRCopies : public MachineFunctionPass {

MachineDominatorTree *MDT;		MachineDominatorTree *MDT;

public:		public:
static char ID;		static char ID;
Show All 11 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}
};		};

} // End anonymous namespace		} // End anonymous namespace

INITIALIZE_PASS_BEGIN(SIFixSGPRCopies, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(SIFixSGPRCopies, DEBUG_TYPE,
"SI Fix SGPR copies", false, false)		"SI Fix SGPR copies", false, false)
INITIALIZE_PASS_DEPENDENCY(MachinePostDominatorTree)		INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
INITIALIZE_PASS_END(SIFixSGPRCopies, DEBUG_TYPE,		INITIALIZE_PASS_END(SIFixSGPRCopies, DEBUG_TYPE,
"SI Fix SGPR copies", false, false)		"SI Fix SGPR copies", false, false)


char SIFixSGPRCopies::ID = 0;		char SIFixSGPRCopies::ID = 0;

char &llvm::SIFixSGPRCopiesID = SIFixSGPRCopies::ID;		char &llvm::SIFixSGPRCopiesID = SIFixSGPRCopies::ID;

▲ Show 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	static bool isSafeToFoldImmIntoCopy(const MachineInstr *Copy,
case AMDGPU::V_MOV_B64_PSEUDO:		case AMDGPU::V_MOV_B64_PSEUDO:
SMovOp = AMDGPU::S_MOV_B64;		SMovOp = AMDGPU::S_MOV_B64;
break;		break;
}		}
Imm = ImmOp->getImm();		Imm = ImmOp->getImm();
return true;		return true;
}		}

static bool predsHasDivergentTerminator(MachineBasicBlock *MBB,		template <class UnaryPredicate>
const TargetRegisterInfo *TRI) {		bool searchPredecessors(const MachineBasicBlock *MBB,
DenseSet<MachineBasicBlock*> Visited;		const MachineBasicBlock *CutOff,
		UnaryPredicate Predicate) {

		if (MBB == CutOff)
		return false;

		DenseSet<const MachineBasicBlock*> Visited;
SmallVector<MachineBasicBlock*, 4> Worklist(MBB->pred_begin(),		SmallVector<MachineBasicBlock*, 4> Worklist(MBB->pred_begin(),
MBB->pred_end());		MBB->pred_end());

while (!Worklist.empty()) {		while (!Worklist.empty()) {
MachineBasicBlock *mbb = Worklist.back();		MachineBasicBlock *MBB = Worklist.pop_back_val();
Worklist.pop_back();

if (!Visited.insert(mbb).second)		if (!Visited.insert(MBB).second)
continue;		continue;
if (hasTerminatorThatModifiesExec(mbb, TRI))		if (MBB == CutOff)
		continue;
		if (Predicate(MBB))
return true;		return true;

Worklist.insert(Worklist.end(), mbb->pred_begin(), mbb->pred_end());		Worklist.append(MBB->pred_begin(), MBB->pred_end());
}		}

return false;		return false;
}		}

		static bool predsHasDivergentTerminator(MachineBasicBlock *MBB,
		const TargetRegisterInfo *TRI) {
		return searchPredecessors(MBB, nullptr, [TRI](MachineBasicBlock *MBB) {
		return hasTerminatorThatModifiesExec(MBB, TRI); });
		}

		// Checks if there is potential path From instruction To instruction.
		// If CutOff is specified and it sits in between of that path we ignore
		// a higher portion of the path and report it is not reachable.
		static bool isReachable(const MachineInstr *From,
		const MachineInstr *To,
		const MachineBasicBlock *CutOff,
		MachineDominatorTree &MDT) {
		// If either From block dominates To block or instructions are in the same
		// block and From is higher.
		if (MDT.dominates(From, To))
		return true;

		const MachineBasicBlock *MBBFrom = From->getParent();
		const MachineBasicBlock *MBBTo = To->getParent();
		if (MBBFrom == MBBTo)
		return false;

		// Instructions are in different blocks, do predecessor search.
		// We should almost never get here since we do not usually produce M0 stores
		// other than -1.
		return searchPredecessors(MBBTo, CutOff, [MBBFrom]
		(const MachineBasicBlock *MBB) { return MBB == MBBFrom; });
		}

		// Hoist and merge identical SGPR initializations into a common predecessor.
		// This is intended to combine M0 initializations, but can work with any
		// SGPR. A VGPR cannot be processed since we cannot guarantee vector
		// executioon.
		static bool hoistAndMergeSGPRInits(unsigned Reg,
		const MachineRegisterInfo &MRI,
		MachineDominatorTree &MDT) {
		// List of inits by immediate value.
		typedef std::map<unsigned, std::list<MachineInstr*>> InitListMap;
		InitListMap Inits;
		// List of clobbering instructions.
		SmallVector<MachineInstr*, 8> Clobbers;
		bool Changed = false;

		for (auto &MI : MRI.def_instructions(Reg)) {
		MachineOperand *Imm = nullptr;
		for (auto &MO: MI.operands()) {
		if ((MO.isReg() && ((MO.isDef() && MO.getReg() != Reg) \|\| !MO.isDef())) \|\|
		(!MO.isImm() && !MO.isReg()) \|\| (MO.isImm() && Imm)) {
		Imm = nullptr;
		break;
		} else if (MO.isImm())
		Imm = &MO;
		}
		if (Imm)
		Inits[Imm->getImm()].push_front(&MI);
		else
		Clobbers.push_back(&MI);
		}

		for (auto &Init : Inits) {
		auto &Defs = Init.second;

		for (auto I1 = Defs.begin(), E = Defs.end(); I1 != E; ) {
		MachineInstr MI1 = I1;

		for (auto I2 = std::next(I1); I2 != E; ) {
		MachineInstr MI2 = I2;

		// Check any possible interference
		auto intereferes = [&](MachineBasicBlock::iterator From,
		MachineBasicBlock::iterator To) -> bool {

		assert(MDT.dominates(&To, &From));

		auto interferes = [&MDT, From, To](MachineInstr* &Clobber) -> bool {
		const MachineBasicBlock *MBBFrom = From->getParent();
		const MachineBasicBlock *MBBTo = To->getParent();
		bool MayClobberFrom = isReachable(Clobber, &*From, MBBTo, MDT);
		bool MayClobberTo = isReachable(Clobber, &*To, MBBTo, MDT);
		if (!MayClobberFrom && !MayClobberTo)
		return false;
		if ((MayClobberFrom && !MayClobberTo) \|\|
		(!MayClobberFrom && MayClobberTo))
		return true;
		// Both can clobber, this is not an interference only if both are
		// dominated by Clobber and belong to the same block or if Clobber
		// properly dominates To, given that To >> From, so it dominates
		// both and located in a common dominator.
		return !((MBBFrom == MBBTo &&
		MDT.dominates(Clobber, &*From) &&
		MDT.dominates(Clobber, &*To)) \|\|
		MDT.properlyDominates(Clobber->getParent(), MBBTo));
		};

		return (any_of(Clobbers, interferes)) \|\|
		(any_of(Inits, [&](InitListMap::value_type &C) {
		return C.first != Init.first && any_of(C.second, interferes);
		}));
		};

		if (MDT.dominates(MI1, MI2)) {
		if (!intereferes(MI2, MI1)) {
		DEBUG(dbgs() << "Erasing from BB#" << MI2->getParent()->getNumber()
		<< " " << *MI2);
		MI2->eraseFromParent();
		Defs.erase(I2++);
		Changed = true;
		continue;
		}
		} else if (MDT.dominates(MI2, MI1)) {
		if (!intereferes(MI1, MI2)) {
		DEBUG(dbgs() << "Erasing from BB#" << MI1->getParent()->getNumber()
		<< " " << *MI1);
		MI1->eraseFromParent();
		Defs.erase(I1++);
		Changed = true;
		break;
		}
		} else {
		auto *MBB = MDT.findNearestCommonDominator(MI1->getParent(),
		MI2->getParent());
		if (!MBB) {
		++I2;
		continue;
		}

		MachineBasicBlock::iterator I = MBB->getFirstNonPHI();
		if (!intereferes(MI1, I) && !intereferes(MI2, I)) {
		DEBUG(dbgs() << "Erasing from BB#" << MI1->getParent()->getNumber()
		<< " " << *MI1 << "and moving from BB#"
		<< MI2->getParent()->getNumber() << " to BB#"
		<< I->getParent()->getNumber() << " " << *MI2);
		I->getParent()->splice(I, MI2->getParent(), MI2);
		MI1->eraseFromParent();
		Defs.erase(I1++);
		Changed = true;
		break;
		}
		}
		++I2;
		}
		++I1;
		}
		}

		if (Changed)
		MRI.clearKillFlags(Reg);

		return Changed;
		}

bool SIFixSGPRCopies::runOnMachineFunction(MachineFunction &MF) {		bool SIFixSGPRCopies::runOnMachineFunction(MachineFunction &MF) {
const SISubtarget &ST = MF.getSubtarget<SISubtarget>();		const SISubtarget &ST = MF.getSubtarget<SISubtarget>();
MachineRegisterInfo &MRI = MF.getRegInfo();		MachineRegisterInfo &MRI = MF.getRegInfo();
const SIRegisterInfo *TRI = ST.getRegisterInfo();		const SIRegisterInfo *TRI = ST.getRegisterInfo();
const SIInstrInfo *TII = ST.getInstrInfo();		const SIInstrInfo *TII = ST.getInstrInfo();
MDT = &getAnalysis<MachineDominatorTree>();		MDT = &getAnalysis<MachineDominatorTree>();

SmallVector<MachineInstr *, 16> Worklist;		SmallVector<MachineInstr *, 16> Worklist;
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
TII->moveToVALU(MI);		TII->moveToVALU(MI);
}		}
break;		break;
}		}
}		}
}		}
}		}

		if (MF.getTarget().getOptLevel() > CodeGenOpt::None && EnableM0Merge)
		hoistAndMergeSGPRInits(AMDGPU::M0, MRI, *MDT);

return true;		return true;
}		}

llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	BitVector SIRegisterInfo::getReservedRegs(const MachineFunction &MF) const {
BitVector Reserved(getNumRegs());		BitVector Reserved(getNumRegs());
Reserved.set(AMDGPU::INDIRECT_BASE_ADDR);		Reserved.set(AMDGPU::INDIRECT_BASE_ADDR);

// EXEC_LO and EXEC_HI could be allocated and used as regular register, but		// EXEC_LO and EXEC_HI could be allocated and used as regular register, but
// this seems likely to result in bugs, so I'm marking them as reserved.		// this seems likely to result in bugs, so I'm marking them as reserved.
reserveRegisterTuples(Reserved, AMDGPU::EXEC);		reserveRegisterTuples(Reserved, AMDGPU::EXEC);
reserveRegisterTuples(Reserved, AMDGPU::FLAT_SCR);		reserveRegisterTuples(Reserved, AMDGPU::FLAT_SCR);

		// M0 has to be reserved so that llvm accepts it as a live-in into a block.
		reserveRegisterTuples(Reserved, AMDGPU::M0);

// Reserve the memory aperture registers.		// Reserve the memory aperture registers.
reserveRegisterTuples(Reserved, AMDGPU::SRC_SHARED_BASE);		reserveRegisterTuples(Reserved, AMDGPU::SRC_SHARED_BASE);
reserveRegisterTuples(Reserved, AMDGPU::SRC_SHARED_LIMIT);		reserveRegisterTuples(Reserved, AMDGPU::SRC_SHARED_LIMIT);
reserveRegisterTuples(Reserved, AMDGPU::SRC_PRIVATE_BASE);		reserveRegisterTuples(Reserved, AMDGPU::SRC_PRIVATE_BASE);
reserveRegisterTuples(Reserved, AMDGPU::SRC_PRIVATE_LIMIT);		reserveRegisterTuples(Reserved, AMDGPU::SRC_PRIVATE_LIMIT);

// Reserve Trap Handler registers - support is not implemented in Codegen.		// Reserve Trap Handler registers - support is not implemented in Codegen.
reserveRegisterTuples(Reserved, AMDGPU::TBA);		reserveRegisterTuples(Reserved, AMDGPU::TBA);
▲ Show 20 Lines • Show All 1,255 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/merge-m0.mir

				# RUN: llc -march=amdgcn -amdgpu-enable-merge-m0 -verify-machineinstrs -run-pass si-fix-sgpr-copies %s -o - \| FileCheck -check-prefix=GCN %s

				# GCN: bb.0.entry:
				# GCN: SI_INIT_M0 -1
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 65536
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 -1
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 65536
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.1:
				# GCN: SI_INIT_M0 -1
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.2:
				# GCN: SI_INIT_M0 65536
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.3:
				# GCN: SI_INIT_M0 3

				# GCN: bb.4:
				# GCN-NOT: SI_INIT_M0
				# GCN: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 4
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.5:
				# GCN-NOT: SI_INIT_M0
				# GCN: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 4
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.6:
				# GCN: SI_INIT_M0 -1,
				# GCN-NEXT: DS_WRITE_B32
				# GCN: SI_INIT_M0 %2
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 %2
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 -1
				# GCN-NEXT: DS_WRITE_B32

				---
				name: test
				alignment: 0
				exposesReturnsTwice: false
				noVRegs: false
				legalized: false
				regBankSelected: false
				selected: false
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32 }
				- { id: 1, class: vgpr_32 }
				- { id: 2, class: sreg_32_xm0 }
				body: \|
				bb.0.entry:
				successors: %bb.1, %bb.2

				%0 = IMPLICIT_DEF
				%1 = IMPLICIT_DEF
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 65536, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 65536, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 65536, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_CBRANCH_VCCZ %bb.1, implicit undef %vcc
				S_BRANCH %bb.2

				bb.1:
				successors: %bb.2
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_BRANCH %bb.2

				bb.2:
				successors: %bb.3
				SI_INIT_M0 65536, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_BRANCH %bb.3

				bb.3:
				successors: %bb.4, %bb.5
				S_CBRANCH_VCCZ %bb.4, implicit undef %vcc
				S_BRANCH %bb.5

				bb.4:
				successors: %bb.6
				SI_INIT_M0 3, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 4, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_BRANCH %bb.6

				bb.5:
				successors: %bb.6
				SI_INIT_M0 3, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 4, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_BRANCH %bb.6

				bb.6:
				successors: %bb.0.entry, %bb.6
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				%2 = IMPLICIT_DEF
				SI_INIT_M0 %2, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 %2, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_CBRANCH_VCCZ %bb.6, implicit undef %vcc
				S_BRANCH %bb.0.entry

				...

llvm/trunk/test/CodeGen/AMDGPU/spill-m0.ll

	Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines
	; m0 is killed, so it isn't necessary during the entry block spill to preserve it			; m0 is killed, so it isn't necessary during the entry block spill to preserve it
	; GCN-LABEL: {{^}}spill_kill_m0_lds:			; GCN-LABEL: {{^}}spill_kill_m0_lds:
	; GCN: s_mov_b32 m0, s6			; GCN: s_mov_b32 m0, s6
	; GCN: v_interp_mov_f32			; GCN: v_interp_mov_f32

	; TOSMEM-NOT: s_m0			; TOSMEM-NOT: s_m0
	; TOSMEM: s_add_u32 m0, s7, 0x100			; TOSMEM: s_add_u32 m0, s7, 0x100
	; TOSMEM-NEXT: s_buffer_store_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 4-byte Folded Spill			; TOSMEM-NEXT: s_buffer_store_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 4-byte Folded Spill
	; TOSMEM-NOT: m0			; FIXME: RegScavenger::isRegUsed() always returns true if m0 is reserved, so we have to save and restore it
				; FIXME-TOSMEM-NOT: m0

	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_add_u32 m0, s7, 0x200			; TOSMEM: s_add_u32 m0, s7, 0x200
	; TOSMEM: s_buffer_store_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Spill			; TOSMEM: s_buffer_store_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Spill
	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0

	; TOSMEM: s_mov_b64 exec,			; TOSMEM: s_mov_b64 exec,
	; TOSMEM: s_cbranch_execz			; TOSMEM: s_cbranch_execz
	; TOSMEM: s_branch			; TOSMEM: s_branch

	; TOSMEM: BB{{[0-9]+_[0-9]+}}:			; TOSMEM: BB{{[0-9]+_[0-9]+}}:
	; TOSMEM-NEXT: s_add_u32 m0, s7, 0x200			; TOSMEM: s_add_u32 m0, s7, 0x200
	; TOSMEM-NEXT: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Reload			; TOSMEM-NEXT: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Reload


	; GCN-NOT: v_readlane_b32 m0			; GCN-NOT: v_readlane_b32 m0
	; GCN-NOT: s_buffer_store_dword m0			; GCN-NOT: s_buffer_store_dword m0
	; GCN-NOT: s_buffer_load_dword m0			; GCN-NOT: s_buffer_load_dword m0
	define amdgpu_ps void @spill_kill_m0_lds(<16 x i8> addrspace(2)* inreg %arg, <16 x i8> addrspace(2)* inreg %arg1, <32 x i8> addrspace(2)* inreg %arg2, i32 inreg %m0) #0 {			define amdgpu_ps void @spill_kill_m0_lds(<16 x i8> addrspace(2)* inreg %arg, <16 x i8> addrspace(2)* inreg %arg1, <32 x i8> addrspace(2)* inreg %arg2, i32 inreg %m0) #0 {
	main_body:			main_body:
	Show All 32 Lines
	; TOSMEM-NEXT: s_buffer_store_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Spill			; TOSMEM-NEXT: s_buffer_store_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Spill
	; TOSMEM: s_mov_b32 m0, vcc_hi			; TOSMEM: s_mov_b32 m0, vcc_hi

	; TOSMEM: s_mov_b64 exec,			; TOSMEM: s_mov_b64 exec,
	; TOSMEM: s_cbranch_execz			; TOSMEM: s_cbranch_execz
	; TOSMEM: s_branch			; TOSMEM: s_branch

	; TOSMEM: BB{{[0-9]+_[0-9]+}}:			; TOSMEM: BB{{[0-9]+_[0-9]+}}:
	; TOSMEM-NEXT: s_add_u32 m0, s3, 0x100			; TOSMEM: s_add_u32 m0, s3, 0x100
	; TOSMEM-NEXT: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Reload			; TOSMEM-NEXT: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Reload

	; GCN-NOT: v_readlane_b32 m0			; GCN-NOT: v_readlane_b32 m0
	; GCN-NOT: s_buffer_store_dword m0			; GCN-NOT: s_buffer_store_dword m0
	; GCN-NOT: s_buffer_load_dword m0			; GCN-NOT: s_buffer_load_dword m0
	define amdgpu_kernel void @m0_unavailable_spill(i32 %m0.arg) #0 {			define amdgpu_kernel void @m0_unavailable_spill(i32 %m0.arg) #0 {
	main_body:			main_body:
	%m0 = call i32 asm sideeffect "; def $0, 1", "={M0}"() #0			%m0 = call i32 asm sideeffect "; def $0, 1", "={M0}"() #0
	Show All 12 Lines

	endif:			endif:
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}restore_m0_lds:			; GCN-LABEL: {{^}}restore_m0_lds:
	; TOSMEM: s_load_dwordx2 [[REG:s\[[0-9]+:[0-9]+\]]]			; TOSMEM: s_load_dwordx2 [[REG:s\[[0-9]+:[0-9]+\]]]
	; TOSMEM: s_cmp_eq_u32			; TOSMEM: s_cmp_eq_u32
	; TOSMEM-NOT: m0			; FIXME: RegScavenger::isRegUsed() always returns true if m0 is reserved, so we have to save and restore it
				; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_add_u32 m0, s3, 0x100			; TOSMEM: s_add_u32 m0, s3, 0x100
	; TOSMEM: s_buffer_store_dwordx2 [[REG]], s[88:91], m0 ; 8-byte Folded Spill			; TOSMEM: s_buffer_store_dwordx2 [[REG]], s[88:91], m0 ; 8-byte Folded Spill
	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_add_u32 m0, s3, 0x300			; TOSMEM: s_add_u32 m0, s3, 0x300
	; TOSMEM: s_buffer_store_dword s{{[0-9]+}}, s[88:91], m0 ; 4-byte Folded Spill			; TOSMEM: s_buffer_store_dword s{{[0-9]+}}, s[88:91], m0 ; 4-byte Folded Spill
	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_cbranch_scc1			; TOSMEM: s_cbranch_scc1

	; TOSMEM: s_mov_b32 m0, -1			; TOSMEM: s_mov_b32 m0, -1

	; TOSMEM: s_mov_b32 vcc_hi, m0			; TOSMEM: s_mov_b32 vcc_hi, m0
	; TOSMEM: s_add_u32 m0, s3, 0x100			; TOSMEM: s_add_u32 m0, s3, 0x100
	; TOSMEM: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s[88:91], m0 ; 8-byte Folded Reload			; TOSMEM: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s[88:91], m0 ; 8-byte Folded Reload
	; TOSMEM: s_mov_b32 m0, vcc_hi			; TOSMEM: s_mov_b32 m0, vcc_hi
	; TOSMEM: s_waitcnt lgkmcnt(0)			; TOSMEM: s_waitcnt lgkmcnt(0)

	; TOSMEM: ds_write_b64			; TOSMEM: ds_write_b64

	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_add_u32 m0, s3, 0x300			; TOSMEM: s_add_u32 m0, s3, 0x300
	; TOSMEM: s_buffer_load_dword s0, s[88:91], m0 ; 4-byte Folded Reload			; TOSMEM: s_buffer_load_dword s0, s[88:91], m0 ; 4-byte Folded Reload
	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_waitcnt lgkmcnt(0)			; TOSMEM: s_waitcnt lgkmcnt(0)
	; TOSMEM-NOT: m0			; TOSMEM-NOT: m0
	; TOSMEM: s_mov_b32 m0, s0			; TOSMEM: s_mov_b32 m0, s0
	; TOSMEM: ; use m0			; TOSMEM: ; use m0

	; TOSMEM: s_dcache_wb			; TOSMEM: s_dcache_wb
	; TOSMEM: s_endpgm			; TOSMEM: s_endpgm
	define amdgpu_kernel void @restore_m0_lds(i32 %arg) {			define amdgpu_kernel void @restore_m0_lds(i32 %arg) {
	Show All 21 Lines