This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Merge M0 initializations
ClosedPublic

Authored by rampitec on Apr 20 2017, 2:20 AM.

Download Raw Diff

Details

Reviewers

vpykhtin
arsenm

Commits

rGbd5394be3d2b: [AMDGPU] Merge M0 initializations
rL301228: [AMDGPU] Merge M0 initializations

Summary

Merges equivalent initializations of M0 and hoists them into a common
dominator block. Technically the same code can be used with any
register, physical or virtual.

It is off by default because it creates performance regressions instead
of improvements. That is caused by an additional freedom scheduler gets
when M0 gets out of its way, and it is notorious for blowing up register
pressure. This is however needed to create a new scheduler and even to
experiment with it, so it is put under an option until new scheduler is
ready.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Apr 20 2017, 2:20 AM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptApr 20 2017, 2:20 AM

Fixed formatting.

Replaced DenseSet::insert() with DenseSet::append().

Thank you for doing this! I really need it.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
473	I'm confused. Why a path from Clobber to From may clobber From? 'From' is a def and having path from Clobber to From clobbers Clober? :-)
478	this is a xor
503	May be it would be better to exhaust Defs by erase only without having push_back/pop_back?
537	Looks like LocalChanged and Changed has the same value, replace with one flag?

rampitec marked 4 inline comments as done.Apr 20 2017, 9:43 AM

rampitec added inline comments.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
473	Here From is the position from which we want to move a def. To is a position to move it to. Clobber is a potentially clobbering instruction. So if a clobber is not reachable at both From and To, we are safe to move in respect of that clobber. If one is reachable and other is not, we are not safe, because clobber will hide an initialization either at old or new position, resulting in a different value coming to consumers of the def. The question is can we move if both a reachable, which is checked later in this lambda. For example if clobber is in an entry block and both from and to positions are well after, they both reachable, but there is no clobbering in between.
478	We do not have ^^ operator, so it is either casts or potential warnings.
503	Here we have removed MI2, but M1 still can be combined with something. We need to repeat the inner loop from the very beginning, thus the push_back. Now we cannot really exhaust Defs and leave it empty. All defs which remain shall be re-added back, so in the next iteration of outer loop processing a different initialization value they become potential clobbers themselves.
537	LocalChanged basically tells us something was combined in the inner loop and we can just continue. If nothing was combined, then instruction has to be moved from Defs into Visited. When we exhaust current Defs the whole Visited will containing only remaining values. These need to be be moved back into Defs. That is to use remaining defs as potential clobbers for other iterations. Then Changed accumulates return value over all iterations.

vpykhtin added inline comments.Apr 20 2017, 10:38 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
428	If I understood correctly this loop ensures that instruction defines only Reg and has one Imm? May be it would be clearer and more reliable if we just check instruction we're intersted in, like moves?
473	ok, this helps. I misunderstood from and to.
496	Why not to combine these checks with the part above like: if (MDT.dominates(MI1, MI2)) { if (!intereferes(MI2, MI1)) { ... } } else if (MDT.dominates(MI2, MI1)) { if (!intereferes(MI1, MI2)) { ... } } else { auto *MBB = MDT.findNearestCommonDominator(MI1->getParent(), MI2->getParent()); if (!MBB) continue; I = MBB->getFirstNonPHI(); if (!intereferes(MI1, I) && !intereferes(MI2, I)) { ... } } and possibly factor out common code in these parts
503	I think you can do like this: Define iterator OI for the outer loop iterating Defs from begin to end (old fashined for loop) Each time internal loop deletes something reset OI to the defs begin You would get rid of push/pop and visited array.
534	Why do you need to restore Defs? It looks like it doesn't used anymore

rampitec marked 7 inline comments as done.Apr 20 2017, 10:47 AM

rampitec added inline comments.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
428	MO is not defined by mov, it is SI_INIT_M0. Also we need to capture all defs to check for clobbering.
496	I can be neither MI1 nor MI2.
503	I do not want to reset it and process what I cannot combine and already know it. Then inner loop would need to run only on a slice of Defs.
534	That is to use remaining defs as potential clobbers for other iterations.

vpykhtin added inline comments.Apr 20 2017, 11:01 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
496	the logic is the same.
503	Ok, then you don't need to reset to begin each time, you just need to handle deletion of element pointed by OI by moving OI to the next position, use std::list for Defs.
534	Ok, then instead of restoring Defs its better to copy it to before processing into a container better suitable for element removal, such like std::list and use iterator for removal instead of using find

I think this is more complicated than it needs to be and is reinventing most of the logic for a generic hoisting pass. We already know the value of m0 at the important uses. OpenGL might also want to initialize it once in the prolog from a register, and the other uses of m0 are less frequent.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
398–400	This shouldn't be needed. m0 would be the only possible live in physreg at this point, and you reserved it
lib/Target/AMDGPU/SIRegisterInfo.cpp
150	My patch specifically avoided doing this. I don't think we want it to be reserved, because this kills all generic copy optimizations.

In D32279#732500, @arsenm wrote:

I think this is more complicated than it needs to be and is reinventing most of the logic for a generic hoisting pass. We already know the value of m0 at the important uses. OpenGL might also want to initialize it once in the prolog from a register, and the other uses of m0 are less frequent.

M0 uses will become more frequent when we implement movrel scratch promotion. Then even without, it is either a single init per function, which is less than needed, or something like this.

rampitec added inline comments.Apr 20 2017, 1:16 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

398–400

This might not be needed since it is now reserved, but then if I want to restore SpillToSMEM I would need live-ins. I think to replace RS->isRegUsed() with:

static bool isRegUsed(unsigned Reg, MachineBasicBlock::iterator MI,
                      const SIRegisterInfo *TRI, RegScavenger *RS) {
  const MachineBasicBlock *MBB = MI->getParent();
  const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();

  if (!MRI.isReserved(Reg))
    return RS->isRegUsed(Reg);

  bool Defined = MBB->isLiveIn(Reg);
  bool Found = false;

  for (auto &I : *MBB) {
    if (I == MI) {
      if (!Defined)
        return false;
      Found = true;
    }
    if (Found && I.readsRegister(Reg, TRI))
      return true;
    if (I.modifiesRegister(Reg, TRI)) {
      if (Found)
        return false;
      Defined = true;
    } else if (I.killsRegister(Reg, TRI)) {
      if (Found)
        return false;
      Defined = false;
    }
  }

  return Defined;
}

lib/Target/AMDGPU/SIRegisterInfo.cpp

150

Here is the problem: when I hoist init out of the use block, I need to add a live in to satisfy live variable analysis and then verifier after it. Now if I have live-ins, I hit this: "MBB has allocatable live-in, but isn't entry or landing-pad." So I have make it reserved then.

Do you have better ideas?

D30227 has been waiting for review for a long time

In D32279#732555, @arsenm wrote:

D30227 has been waiting for review for a long time

You have no reviewers, so I guess that is why ;)
Seriously, this looks good, although there can be less than efficient side effects if we use M0 for something else besides LDS.
Then another problem, even in this change I have to disable it by default due to problems with register pressure after scheduler. It can only be enabled when scheduler can deal with pressure better, but at the same time we need it to write that scheduler. I do not see an easy way to disable D30227.

Combined checks in the inner loop.

In D32279#732558, @rampitec wrote:

In D32279#732555, @arsenm wrote:

D30227 has been waiting for review for a long time

You have no reviewers, so I guess that is why ;)
Seriously, this looks good, although there can be less than efficient side effects if we use M0 for something else besides LDS.
Then another problem, even in this change I have to disable it by default due to problems with register pressure after scheduler. It can only be enabled when scheduler can deal with pressure better, but at the same time we need it to write that scheduler. I do not see an easy way to disable D30227.

It shouldn't be necessary to disable. Another factor is being able to eliminate the initializations on gfx9 for lds

The other patch may not have the same scheduler problem impact

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

In D32279#732578, @rampitec wrote:

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

In D32279#732613, @arsenm wrote:

In D32279#732578, @rampitec wrote:

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

In D32279#732614, @rampitec wrote:

In D32279#732613, @arsenm wrote:

In D32279#732578, @rampitec wrote:

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

Switched to list from vector.
Removed declspec which breaks MSVC.

rampitec marked 5 inline comments as done.Apr 20 2017, 3:17 PM

In D32279#732624, @arsenm wrote:

In D32279#732614, @rampitec wrote:

In D32279#732613, @arsenm wrote:

In D32279#732578, @rampitec wrote:

In D32279#732566, @arsenm wrote:

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

No

The patch really seems old, there are too much conflicts as I tried to apply it.

Removed live-in logic since M0 end up reserved anyway.

Looks very good now! Thanks!

This revision is now accepted and ready to land.Apr 21 2017, 4:10 AM

Removed temp iterator copies before erase() since there is a sequence point before the actual call.

arsenm added inline comments.Apr 21 2017, 12:27 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
391	typo diffeernt
409	list is weird. vector or SmallVector?

Fixed typo in comment.

Reverted back to SmallVector for clobbers, it does not need to be list.

rampitec added inline comments.Apr 21 2017, 4:46 PM

lib/Target/AMDGPU/SIRegisterInfo.cpp
150	Actually D30227 should have the same issues with phys live-ins (and have them according to its description), so it also shall fail verification unless M0 is reserved. Two patches are not different here, the problem is not in the approach, but in the very fact of physreg live-in to a block. I can remove reserveRegisterTuples, restore live-in generation and result will be the same as in D30227, it will work, but will not pass verification. Reserving M0 allows it to pass.

arsenm added inline comments.Apr 21 2017, 4:51 PM

lib/Target/AMDGPU/SIRegisterInfo.cpp
150	I think the point of mostly handling in the DAG was the generic instremitter handled figuring out where live in phys regs was absolutely necessary. I also went through a few iterations where m0 wasn't live in, but there was a copy from the one initialized register in the prolog

rampitec added inline comments.Apr 21 2017, 4:54 PM

lib/Target/AMDGPU/SIRegisterInfo.cpp
150	The code I have removed did minimal live-in attribution (i.e. where absolutely necessary). That still does not change the fact that phys live-in only expected in entry and ehpad.

LGTM

Closed by commit rL301228: [AMDGPU] Merge M0 initializations (authored by rampitec). · Explain WhyApr 24 2017, 12:50 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIFixSGPRCopies.cpp

207 lines

SIRegisterInfo.cpp

3 lines

test/

CodeGen/

AMDGPU/

merge-m0.mir

132 lines

spill-m0.ll

22 lines

Diff 96037

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "si-fix-sgpr-copies"		#define DEBUG_TYPE "si-fix-sgpr-copies"

		static cl::opt<bool> EnableM0Merge(
		"amdgpu-enable-merge-m0",
		cl::desc("Merge and hoist M0 initializations"),
		cl::init(false));

namespace {		namespace {

class SIFixSGPRCopies : public MachineFunctionPass {		class SIFixSGPRCopies : public MachineFunctionPass {

MachineDominatorTree *MDT;		MachineDominatorTree *MDT;

public:		public:
static char ID;		static char ID;
Show All 11 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}
};		};

} // End anonymous namespace		} // End anonymous namespace

INITIALIZE_PASS_BEGIN(SIFixSGPRCopies, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(SIFixSGPRCopies, DEBUG_TYPE,
"SI Fix SGPR copies", false, false)		"SI Fix SGPR copies", false, false)
INITIALIZE_PASS_DEPENDENCY(MachinePostDominatorTree)		INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
INITIALIZE_PASS_END(SIFixSGPRCopies, DEBUG_TYPE,		INITIALIZE_PASS_END(SIFixSGPRCopies, DEBUG_TYPE,
"SI Fix SGPR copies", false, false)		"SI Fix SGPR copies", false, false)


char SIFixSGPRCopies::ID = 0;		char SIFixSGPRCopies::ID = 0;

char &llvm::SIFixSGPRCopiesID = SIFixSGPRCopies::ID;		char &llvm::SIFixSGPRCopiesID = SIFixSGPRCopies::ID;

▲ Show 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	static bool isSafeToFoldImmIntoCopy(const MachineInstr *Copy,
case AMDGPU::V_MOV_B64_PSEUDO:		case AMDGPU::V_MOV_B64_PSEUDO:
SMovOp = AMDGPU::S_MOV_B64;		SMovOp = AMDGPU::S_MOV_B64;
break;		break;
}		}
Imm = ImmOp->getImm();		Imm = ImmOp->getImm();
return true;		return true;
}		}

static bool predsHasDivergentTerminator(MachineBasicBlock *MBB,		template <class UnaryPredicate>
const TargetRegisterInfo *TRI) {		bool searchPredecessors(const MachineBasicBlock *MBB,
DenseSet<MachineBasicBlock*> Visited;		const MachineBasicBlock *CutOff,
		UnaryPredicate Predicate) {

		if (MBB == CutOff)
		return false;

		DenseSet<const MachineBasicBlock*> Visited;
SmallVector<MachineBasicBlock*, 4> Worklist(MBB->pred_begin(),		SmallVector<MachineBasicBlock*, 4> Worklist(MBB->pred_begin(),
MBB->pred_end());		MBB->pred_end());

while (!Worklist.empty()) {		while (!Worklist.empty()) {
MachineBasicBlock *mbb = Worklist.back();		MachineBasicBlock *MBB = Worklist.pop_back_val();
Worklist.pop_back();

if (!Visited.insert(mbb).second)		if (!Visited.insert(MBB).second)
continue;		continue;
if (hasTerminatorThatModifiesExec(mbb, TRI))		if (MBB == CutOff)
		continue;
		if (Predicate(MBB))
return true;		return true;

Worklist.insert(Worklist.end(), mbb->pred_begin(), mbb->pred_end());		Worklist.append(MBB->pred_begin(), MBB->pred_end());
}		}

return false;		return false;
}		}

		static bool predsHasDivergentTerminator(MachineBasicBlock *MBB,
		const TargetRegisterInfo *TRI) {
		return searchPredecessors(MBB, nullptr, [TRI](MachineBasicBlock *MBB) {
		return hasTerminatorThatModifiesExec(MBB, TRI); });
		}

		// Checks if there is potential path From instruction To instruction.
		// If CutOff is specified and it sits in between of that path we ignore
		// a higher portion of the path and report it is not reachable.
		static bool isReachable(const MachineInstr *From,
		const MachineInstr *To,
		const MachineBasicBlock *CutOff,
		MachineDominatorTree &MDT) {
		// If either From block dominates To block or instructions are in the same
		// block and From is higher.
		if (MDT.dominates(From, To))
		return true;

		const MachineBasicBlock *MBBFrom = From->getParent();
		const MachineBasicBlock *MBBTo = To->getParent();
		if (MBBFrom == MBBTo)
		return false;

		// Instructions are in diffeernt blocks, do predecessor search.
		arsenmUnsubmitted Done Reply Inline Actions typo diffeernt arsenm: typo diffeernt
		// We should almost never get here since we do not usually produce M0 stores
		// other than -1.
		return searchPredecessors(MBBTo, CutOff, [MBBFrom]
		(const MachineBasicBlock *MBB) { return MBB == MBBFrom; });
		}

		// Recursively add live-in to a BB and its predecessors until Root.
		static void addLiveIn(MachineBasicBlock MBB, const MachineBasicBlock Root,
		unsigned Reg) {
		arsenmUnsubmitted Done Reply Inline Actions This shouldn't be needed. m0 would be the only possible live in physreg at this point, and you reserved it arsenm: This shouldn't be needed. m0 would be the only possible live in physreg at this point, and you…
		rampitecAuthorUnsubmitted Done Reply Inline Actions This might not be needed since it is now reserved, but then if I want to restore SpillToSMEM I would need live-ins. I think to replace RS->isRegUsed() with: static bool isRegUsed(unsigned Reg, MachineBasicBlock::iterator MI, const SIRegisterInfo TRI, RegScavenger RS) { const MachineBasicBlock MBB = MI->getParent(); const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo(); if (!MRI.isReserved(Reg)) return RS->isRegUsed(Reg); bool Defined = MBB->isLiveIn(Reg); bool Found = false; for (auto &I : MBB) { if (I == MI) { if (!Defined) return false; Found = true; } if (Found && I.readsRegister(Reg, TRI)) return true; if (I.modifiesRegister(Reg, TRI)) { if (Found) return false; Defined = true; } else if (I.killsRegister(Reg, TRI)) { if (Found) return false; Defined = false; } } return Defined; } rampitec: This might not be needed since it is now reserved, but then if I want to restore SpillToSMEM I…
		if (TargetRegisterInfo::isVirtualRegister(Reg))
		return;

		if (!MBB->isLiveIn(Reg))
		MBB->addLiveIn(Reg);
		searchPredecessors(MBB, Root, [Reg] (MachineBasicBlock *MBB) {
		if (!MBB->isLiveIn(Reg))
		MBB->addLiveIn(Reg);
		return false;
		arsenmUnsubmitted Done Reply Inline Actions list is weird. vector or SmallVector? arsenm: list is weird. vector or SmallVector?
		});
		}

		// Hoist and merge identical SGPR initializations into a common predecessor.
		// This is intended to combine M0 initializations, but can work with any
		// SGPR. A VGPR cannot be processed since we cannot guarantee vector
		// executioon.
		static bool hoistAndMergeSGPRInits(unsigned Reg,
		const MachineRegisterInfo &MRI,
		MachineDominatorTree &MDT) {
		// List of inits by immediate value.
		typedef std::map<unsigned, std::list<MachineInstr*>> InitListMap;
		InitListMap Inits;
		// List of clobbering instructions.
		std::list<MachineInstr*> Clobbers;
		bool Changed = false;

		for (auto &MI : MRI.def_instructions(Reg)) {
		MachineOperand *Imm = nullptr;
		vpykhtinUnsubmitted Done Reply Inline Actions If I understood correctly this loop ensures that instruction defines only Reg and has one Imm? May be it would be clearer and more reliable if we just check instruction we're intersted in, like moves? vpykhtin: If I understood correctly this loop ensures that instruction defines only Reg and has one Imm?
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions MO is not defined by mov, it is SI_INIT_M0. Also we need to capture all defs to check for clobbering. rampitec: MO is not defined by mov, it is SI_INIT_M0. Also we need to capture all defs to check for…
		for (auto &MO: MI.operands()) {
		if ((MO.isReg() && ((MO.isDef() && MO.getReg() != Reg) \|\| !MO.isDef())) \|\|
		(!MO.isImm() && !MO.isReg()) \|\| (MO.isImm() && Imm)) {
		Imm = nullptr;
		break;
		} else if (MO.isImm())
		Imm = &MO;
		}
		if (Imm)
		Inits[Imm->getImm()].push_front(&MI);
		else
		Clobbers.push_front(&MI);
		}

		for (auto &Init : Inits) {
		auto &Defs = Init.second;

		for (auto I1 = Defs.begin(), E = Defs.end(); I1 != E; ) {
		MachineInstr MI1 = I1;

		for (auto I2 = std::next(I1); I2 != E; ) {
		MachineInstr MI2 = I2;

		// Check any possible interference
		auto intereferes = [&](MachineBasicBlock::iterator From,
		MachineBasicBlock::iterator To) -> bool {

		assert(MDT.dominates(&To, &From));

		auto interferes = [&MDT, From, To](MachineInstr* &Clobber) -> bool {
		const MachineBasicBlock *MBBFrom = From->getParent();
		const MachineBasicBlock *MBBTo = To->getParent();
		bool MayClobberFrom = isReachable(Clobber, &*From, MBBTo, MDT);
		bool MayClobberTo = isReachable(Clobber, &*To, MBBTo, MDT);
		if (!MayClobberFrom && !MayClobberTo)
		return false;
		if ((MayClobberFrom && !MayClobberTo) \|\|
		(!MayClobberFrom && MayClobberTo))
		return true;
		// Both can clobber, this is not an interference only if both are
		// dominated by Clobber and belong to the same block or if Clobber
		// properly dominates To, given that To >> From, so it dominates
		// both and located in a common dominator.
		return !((MBBFrom == MBBTo &&
		MDT.dominates(Clobber, &*From) &&
		vpykhtinUnsubmitted Done Reply Inline Actions I'm confused. Why a path from Clobber to From may clobber From? 'From' is a def and having path from Clobber to From clobbers Clober? :-) vpykhtin: I'm confused. Why a path from Clobber to From may clobber From? 'From' is a def and having path…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions Here From is the position from which we want to move a def. To is a position to move it to. Clobber is a potentially clobbering instruction. So if a clobber is not reachable at both From and To, we are safe to move in respect of that clobber. If one is reachable and other is not, we are not safe, because clobber will hide an initialization either at old or new position, resulting in a different value coming to consumers of the def. The question is can we move if both a reachable, which is checked later in this lambda. For example if clobber is in an entry block and both from and to positions are well after, they both reachable, but there is no clobbering in between. rampitec: Here From is the position from which we want to move a def. To is a position to move it to.
		vpykhtinUnsubmitted Not Done Reply Inline Actions ok, this helps. I misunderstood from and to. vpykhtin: ok, this helps. I misunderstood from and to.
		MDT.dominates(Clobber, &*To)) \|\|
		MDT.properlyDominates(Clobber->getParent(), MBBTo));
		};

		return (any_of(Clobbers, interferes)) \|\|
		vpykhtinUnsubmitted Done Reply Inline Actions this is a xor vpykhtin: this is a xor
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions We do not have ^^ operator, so it is either casts or potential warnings. rampitec: We do not have ^^ operator, so it is either casts or potential warnings.
		(any_of(Inits, [&](InitListMap::value_type &C) {
		return C.first != Init.first && any_of(C.second, interferes);
		}));
		};

		if (MDT.dominates(MI1, MI2)) {
		if (!intereferes(MI2, MI1)) {
		DEBUG(dbgs() << "Erasing from BB#" << MI2->getParent()->getNumber()
		<< " " << *MI2);
		addLiveIn(MI2->getParent(), MI1->getParent(), Reg);
		MI2->eraseFromParent();
		auto Rem = I2++;
		Defs.erase(Rem);
		Changed = true;
		continue;
		}
		} else if (MDT.dominates(MI2, MI1)) {
		if (!intereferes(MI1, MI2)) {
		vpykhtinUnsubmitted Done Reply Inline Actions Why not to combine these checks with the part above like: if (MDT.dominates(MI1, MI2)) { if (!intereferes(MI2, MI1)) { ... } } else if (MDT.dominates(MI2, MI1)) { if (!intereferes(MI1, MI2)) { ... } } else { auto MBB = MDT.findNearestCommonDominator(MI1->getParent(), MI2->getParent()); if (!MBB) continue; I = MBB->getFirstNonPHI(); if (!intereferes(MI1, I) && !intereferes(MI2, I)) { ... } } and possibly factor out common code in these parts vpykhtin:* Why not to combine these checks with the part above like: ``` if (MDT.dominates(MI1, MI2)) {…
		rampitecAuthorUnsubmitted Done Reply Inline Actions I can be neither MI1 nor MI2. rampitec: I can be neither MI1 nor MI2.
		vpykhtinUnsubmitted Done Reply Inline Actions the logic is the same. vpykhtin: the logic is the same.
		DEBUG(dbgs() << "Erasing from BB#" << MI1->getParent()->getNumber()
		<< " " << *MI1);
		addLiveIn(MI1->getParent(), MI2->getParent(), Reg);
		MI1->eraseFromParent();
		auto Rem = I1++;
		Defs.erase(Rem);
		Changed = true;
		vpykhtinUnsubmitted Done Reply Inline Actions May be it would be better to exhaust Defs by erase only without having push_back/pop_back? vpykhtin: May be it would be better to exhaust Defs by erase only without having push_back/pop_back?
		rampitecAuthorUnsubmitted Done Reply Inline Actions Here we have removed MI2, but M1 still can be combined with something. We need to repeat the inner loop from the very beginning, thus the push_back. Now we cannot really exhaust Defs and leave it empty. All defs which remain shall be re-added back, so in the next iteration of outer loop processing a different initialization value they become potential clobbers themselves. rampitec: Here we have removed MI2, but M1 still can be combined with something. We need to repeat the…
		vpykhtinUnsubmitted Done Reply Inline Actions I think you can do like this: Define iterator OI for the outer loop iterating Defs from begin to end (old fashined for loop) Each time internal loop deletes something reset OI to the defs begin You would get rid of push/pop and visited array. vpykhtin: I think you can do like this: 1. Define iterator OI for the outer loop iterating Defs from…
		rampitecAuthorUnsubmitted Done Reply Inline Actions I do not want to reset it and process what I cannot combine and already know it. Then inner loop would need to run only on a slice of Defs. rampitec: I do not want to reset it and process what I cannot combine and already know it. Then inner…
		vpykhtinUnsubmitted Done Reply Inline Actions Ok, then you don't need to reset to begin each time, you just need to handle deletion of element pointed by OI by moving OI to the next position, use std::list for Defs. vpykhtin: Ok, then you don't need to reset to begin each time, you just need to handle deletion of…
		break;
		}
		} else {
		auto *MBB = MDT.findNearestCommonDominator(MI1->getParent(),
		MI2->getParent());
		if (!MBB) {
		++I2;
		continue;
		}

		MachineBasicBlock::iterator I = MBB->getFirstNonPHI();
		if (!intereferes(MI1, I) && !intereferes(MI2, I)) {
		DEBUG(dbgs() << "Erasing from BB#" << MI1->getParent()->getNumber()
		<< " " << *MI1 << "and moving from BB#"
		<< MI2->getParent()->getNumber() << " to BB#"
		<< I->getParent()->getNumber() << " " << *MI2);
		addLiveIn(MI2->getParent(), I->getParent(), Reg);
		I->getParent()->splice(I, MI2->getParent(), MI2);
		addLiveIn(MI1->getParent(), I->getParent(), Reg);
		MI1->eraseFromParent();
		auto Rem = I1++;
		Defs.erase(Rem);
		Changed = true;
		break;
		}
		}
		++I2;
		}
		++I1;
		}
		}
		vpykhtinUnsubmitted Done Reply Inline Actions Why do you need to restore Defs? It looks like it doesn't used anymore vpykhtin: Why do you need to restore Defs? It looks like it doesn't used anymore
		rampitecAuthorUnsubmitted Done Reply Inline Actions That is to use remaining defs as potential clobbers for other iterations. rampitec: That is to use remaining defs as potential clobbers for other iterations.
		vpykhtinUnsubmitted Done Reply Inline Actions Ok, then instead of restoring Defs its better to copy it to before processing into a container better suitable for element removal, such like std::list and use iterator for removal instead of using find vpykhtin: Ok, then instead of restoring Defs its better to copy it to before processing into a container…

		if (Changed)
		MRI.clearKillFlags(Reg);
		vpykhtinUnsubmitted Done Reply Inline Actions Looks like LocalChanged and Changed has the same value, replace with one flag? vpykhtin: Looks like LocalChanged and Changed has the same value, replace with one flag?
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions LocalChanged basically tells us something was combined in the inner loop and we can just continue. If nothing was combined, then instruction has to be moved from Defs into Visited. When we exhaust current Defs the whole Visited will containing only remaining values. These need to be be moved back into Defs. That is to use remaining defs as potential clobbers for other iterations. Then Changed accumulates return value over all iterations. rampitec: LocalChanged basically tells us something was combined in the inner loop and we can just…

		return Changed;
		}

bool SIFixSGPRCopies::runOnMachineFunction(MachineFunction &MF) {		bool SIFixSGPRCopies::runOnMachineFunction(MachineFunction &MF) {
const SISubtarget &ST = MF.getSubtarget<SISubtarget>();		const SISubtarget &ST = MF.getSubtarget<SISubtarget>();
MachineRegisterInfo &MRI = MF.getRegInfo();		MachineRegisterInfo &MRI = MF.getRegInfo();
const SIRegisterInfo *TRI = ST.getRegisterInfo();		const SIRegisterInfo *TRI = ST.getRegisterInfo();
const SIInstrInfo *TII = ST.getInstrInfo();		const SIInstrInfo *TII = ST.getInstrInfo();
MDT = &getAnalysis<MachineDominatorTree>();		MDT = &getAnalysis<MachineDominatorTree>();

SmallVector<MachineInstr *, 16> Worklist;		SmallVector<MachineInstr *, 16> Worklist;
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
TII->moveToVALU(MI);		TII->moveToVALU(MI);
}		}
break;		break;
}		}
}		}
}		}
}		}

		if (MF.getTarget().getOptLevel() > CodeGenOpt::None && EnableM0Merge)
		hoistAndMergeSGPRInits(AMDGPU::M0, MRI, *MDT);

return true;		return true;
}		}

lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	BitVector SIRegisterInfo::getReservedRegs(const MachineFunction &MF) const {
BitVector Reserved(getNumRegs());		BitVector Reserved(getNumRegs());
Reserved.set(AMDGPU::INDIRECT_BASE_ADDR);		Reserved.set(AMDGPU::INDIRECT_BASE_ADDR);

// EXEC_LO and EXEC_HI could be allocated and used as regular register, but		// EXEC_LO and EXEC_HI could be allocated and used as regular register, but
// this seems likely to result in bugs, so I'm marking them as reserved.		// this seems likely to result in bugs, so I'm marking them as reserved.
reserveRegisterTuples(Reserved, AMDGPU::EXEC);		reserveRegisterTuples(Reserved, AMDGPU::EXEC);
reserveRegisterTuples(Reserved, AMDGPU::FLAT_SCR);		reserveRegisterTuples(Reserved, AMDGPU::FLAT_SCR);

		// M0 has to be reserved so that llvm accepts it as a live-in into a block.
		reserveRegisterTuples(Reserved, AMDGPU::M0);
		arsenmUnsubmitted Not Done Reply Inline Actions My patch specifically avoided doing this. I don't think we want it to be reserved, because this kills all generic copy optimizations. arsenm: My patch specifically avoided doing this. I don't think we want it to be reserved, because this…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions Here is the problem: when I hoist init out of the use block, I need to add a live in to satisfy live variable analysis and then verifier after it. Now if I have live-ins, I hit this: "MBB has allocatable live-in, but isn't entry or landing-pad." So I have make it reserved then. Do you have better ideas? rampitec: Here is the problem: when I hoist init out of the use block, I need to add a live in to satisfy…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions Actually D30227 should have the same issues with phys live-ins (and have them according to its description), so it also shall fail verification unless M0 is reserved. Two patches are not different here, the problem is not in the approach, but in the very fact of physreg live-in to a block. I can remove reserveRegisterTuples, restore live-in generation and result will be the same as in D30227, it will work, but will not pass verification. Reserving M0 allows it to pass. rampitec: Actually D30227 should have the same issues with phys live-ins (and have them according to its…
		arsenmUnsubmitted Not Done Reply Inline Actions I think the point of mostly handling in the DAG was the generic instremitter handled figuring out where live in phys regs was absolutely necessary. I also went through a few iterations where m0 wasn't live in, but there was a copy from the one initialized register in the prolog arsenm: I think the point of mostly handling in the DAG was the generic instremitter handled figuring…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions The code I have removed did minimal live-in attribution (i.e. where absolutely necessary). That still does not change the fact that phys live-in only expected in entry and ehpad. rampitec: The code I have removed did minimal live-in attribution (i.e. where absolutely necessary). That…

// Reserve the memory aperture registers.		// Reserve the memory aperture registers.
reserveRegisterTuples(Reserved, AMDGPU::SRC_SHARED_BASE);		reserveRegisterTuples(Reserved, AMDGPU::SRC_SHARED_BASE);
reserveRegisterTuples(Reserved, AMDGPU::SRC_SHARED_LIMIT);		reserveRegisterTuples(Reserved, AMDGPU::SRC_SHARED_LIMIT);
reserveRegisterTuples(Reserved, AMDGPU::SRC_PRIVATE_BASE);		reserveRegisterTuples(Reserved, AMDGPU::SRC_PRIVATE_BASE);
reserveRegisterTuples(Reserved, AMDGPU::SRC_PRIVATE_LIMIT);		reserveRegisterTuples(Reserved, AMDGPU::SRC_PRIVATE_LIMIT);

// Reserve Trap Handler registers - support is not implemented in Codegen.		// Reserve Trap Handler registers - support is not implemented in Codegen.
reserveRegisterTuples(Reserved, AMDGPU::TBA);		reserveRegisterTuples(Reserved, AMDGPU::TBA);
▲ Show 20 Lines • Show All 1,252 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/merge-m0.mir

This file was added.

				# RUN: llc -march=amdgcn -amdgpu-enable-merge-m0 -verify-machineinstrs -run-pass si-fix-sgpr-copies %s -o - \| FileCheck -check-prefix=GCN %s

				# GCN: bb.0.entry:
				# GCN: SI_INIT_M0 -1
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 65536
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 -1
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 65536
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.1:
				# GCN: SI_INIT_M0 -1
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.2:
				# GCN: SI_INIT_M0 65536
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.3:
				# GCN: SI_INIT_M0 3

				# GCN: bb.4:
				# GCN-NOT: SI_INIT_M0
				# GCN: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 4
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.5:
				# GCN-NOT: SI_INIT_M0
				# GCN: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 4
				# GCN-NEXT: DS_WRITE_B32

				# GCN: bb.6:
				# GCN: SI_INIT_M0 -1,
				# GCN-NEXT: DS_WRITE_B32
				# GCN: SI_INIT_M0 %2
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 %2
				# GCN-NEXT: DS_WRITE_B32
				# GCN-NEXT: SI_INIT_M0 -1
				# GCN-NEXT: DS_WRITE_B32

				---
				name: test
				alignment: 0
				exposesReturnsTwice: false
				noVRegs: false
				legalized: false
				regBankSelected: false
				selected: false
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32 }
				- { id: 1, class: vgpr_32 }
				- { id: 2, class: sreg_32_xm0 }
				body: \|
				bb.0.entry:
				successors: %bb.1, %bb.2

				%0 = IMPLICIT_DEF
				%1 = IMPLICIT_DEF
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 65536, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 65536, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 65536, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_CBRANCH_VCCZ %bb.1, implicit undef %vcc
				S_BRANCH %bb.2

				bb.1:
				successors: %bb.2
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_BRANCH %bb.2

				bb.2:
				successors: %bb.3
				SI_INIT_M0 65536, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_BRANCH %bb.3

				bb.3:
				successors: %bb.4, %bb.5
				S_CBRANCH_VCCZ %bb.4, implicit undef %vcc
				S_BRANCH %bb.5

				bb.4:
				successors: %bb.6
				SI_INIT_M0 3, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 4, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_BRANCH %bb.6

				bb.5:
				successors: %bb.6
				SI_INIT_M0 3, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 4, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_BRANCH %bb.6

				bb.6:
				successors: %bb.0.entry, %bb.6
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				%2 = IMPLICIT_DEF
				SI_INIT_M0 %2, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 %2, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				SI_INIT_M0 -1, implicit-def %m0
				DS_WRITE_B32 %0, %1, 0, 0, implicit %m0, implicit %exec
				S_CBRANCH_VCCZ %bb.6, implicit undef %vcc
				S_BRANCH %bb.0.entry

				...

test/CodeGen/AMDGPU/spill-m0.ll

	Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines
	; m0 is killed, so it isn't necessary during the entry block spill to preserve it			; m0 is killed, so it isn't necessary during the entry block spill to preserve it
	; GCN-LABEL: {{^}}spill_kill_m0_lds:			; GCN-LABEL: {{^}}spill_kill_m0_lds:
	; GCN: s_mov_b32 m0, s6			; GCN: s_mov_b32 m0, s6
	; GCN: v_interp_mov_f32			; GCN: v_interp_mov_f32

	; TOSMEM-NOT: s_m0			; TOSMEM-NOT: s_m0
	; TOSMEM: s_add_u32 m0, s7, 0x100			; TOSMEM: s_add_u32 m0, s7, 0x100
	; TOSMEM-NEXT: s_buffer_store_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 4-byte Folded Spill			; TOSMEM-NEXT: s_buffer_store_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 4-byte Folded Spill
	; TOSMEM-NOT: m0			; FIXME: RegScavenger::isRegUsed() always returns true if m0 is reserved, so we have to save and restore it
				; FIXME-TOSMEM-NOT: m0

	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_add_u32 m0, s7, 0x200			; TOSMEM: s_add_u32 m0, s7, 0x200
	; TOSMEM: s_buffer_store_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Spill			; TOSMEM: s_buffer_store_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Spill
	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0

	; TOSMEM: s_mov_b64 exec,			; TOSMEM: s_mov_b64 exec,
	; TOSMEM: s_cbranch_execz			; TOSMEM: s_cbranch_execz
	; TOSMEM: s_branch			; TOSMEM: s_branch

	; TOSMEM: BB{{[0-9]+_[0-9]+}}:			; TOSMEM: BB{{[0-9]+_[0-9]+}}:
	; TOSMEM-NEXT: s_add_u32 m0, s7, 0x200			; TOSMEM: s_add_u32 m0, s7, 0x200
	; TOSMEM-NEXT: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Reload			; TOSMEM-NEXT: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Reload


	; GCN-NOT: v_readlane_b32 m0			; GCN-NOT: v_readlane_b32 m0
	; GCN-NOT: s_buffer_store_dword m0			; GCN-NOT: s_buffer_store_dword m0
	; GCN-NOT: s_buffer_load_dword m0			; GCN-NOT: s_buffer_load_dword m0
	define amdgpu_ps void @spill_kill_m0_lds(<16 x i8> addrspace(2)* inreg %arg, <16 x i8> addrspace(2)* inreg %arg1, <32 x i8> addrspace(2)* inreg %arg2, i32 inreg %m0) #0 {			define amdgpu_ps void @spill_kill_m0_lds(<16 x i8> addrspace(2)* inreg %arg, <16 x i8> addrspace(2)* inreg %arg1, <32 x i8> addrspace(2)* inreg %arg2, i32 inreg %m0) #0 {
	main_body:			main_body:
	Show All 32 Lines
	; TOSMEM-NEXT: s_buffer_store_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Spill			; TOSMEM-NEXT: s_buffer_store_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Spill
	; TOSMEM: s_mov_b32 m0, vcc_hi			; TOSMEM: s_mov_b32 m0, vcc_hi

	; TOSMEM: s_mov_b64 exec,			; TOSMEM: s_mov_b64 exec,
	; TOSMEM: s_cbranch_execz			; TOSMEM: s_cbranch_execz
	; TOSMEM: s_branch			; TOSMEM: s_branch

	; TOSMEM: BB{{[0-9]+_[0-9]+}}:			; TOSMEM: BB{{[0-9]+_[0-9]+}}:
	; TOSMEM-NEXT: s_add_u32 m0, s3, 0x100			; TOSMEM: s_add_u32 m0, s3, 0x100
	; TOSMEM-NEXT: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Reload			; TOSMEM-NEXT: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, m0 ; 8-byte Folded Reload

	; GCN-NOT: v_readlane_b32 m0			; GCN-NOT: v_readlane_b32 m0
	; GCN-NOT: s_buffer_store_dword m0			; GCN-NOT: s_buffer_store_dword m0
	; GCN-NOT: s_buffer_load_dword m0			; GCN-NOT: s_buffer_load_dword m0
	define amdgpu_kernel void @m0_unavailable_spill(i32 %m0.arg) #0 {			define amdgpu_kernel void @m0_unavailable_spill(i32 %m0.arg) #0 {
	main_body:			main_body:
	%m0 = call i32 asm sideeffect "; def $0, 1", "={M0}"() #0			%m0 = call i32 asm sideeffect "; def $0, 1", "={M0}"() #0
	Show All 12 Lines

	endif:			endif:
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}restore_m0_lds:			; GCN-LABEL: {{^}}restore_m0_lds:
	; TOSMEM: s_load_dwordx2 [[REG:s\[[0-9]+:[0-9]+\]]]			; TOSMEM: s_load_dwordx2 [[REG:s\[[0-9]+:[0-9]+\]]]
	; TOSMEM: s_cmp_eq_u32			; TOSMEM: s_cmp_eq_u32
	; TOSMEM-NOT: m0			; FIXME: RegScavenger::isRegUsed() always returns true if m0 is reserved, so we have to save and restore it
				; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_add_u32 m0, s3, 0x100			; TOSMEM: s_add_u32 m0, s3, 0x100
	; TOSMEM: s_buffer_store_dwordx2 [[REG]], s[88:91], m0 ; 8-byte Folded Spill			; TOSMEM: s_buffer_store_dwordx2 [[REG]], s[88:91], m0 ; 8-byte Folded Spill
	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_add_u32 m0, s3, 0x300			; TOSMEM: s_add_u32 m0, s3, 0x300
	; TOSMEM: s_buffer_store_dword s{{[0-9]+}}, s[88:91], m0 ; 4-byte Folded Spill			; TOSMEM: s_buffer_store_dword s{{[0-9]+}}, s[88:91], m0 ; 4-byte Folded Spill
	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_cbranch_scc1			; TOSMEM: s_cbranch_scc1

	; TOSMEM: s_mov_b32 m0, -1			; TOSMEM: s_mov_b32 m0, -1

	; TOSMEM: s_mov_b32 vcc_hi, m0			; TOSMEM: s_mov_b32 vcc_hi, m0
	; TOSMEM: s_add_u32 m0, s3, 0x100			; TOSMEM: s_add_u32 m0, s3, 0x100
	; TOSMEM: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s[88:91], m0 ; 8-byte Folded Reload			; TOSMEM: s_buffer_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s[88:91], m0 ; 8-byte Folded Reload
	; TOSMEM: s_mov_b32 m0, vcc_hi			; TOSMEM: s_mov_b32 m0, vcc_hi
	; TOSMEM: s_waitcnt lgkmcnt(0)			; TOSMEM: s_waitcnt lgkmcnt(0)

	; TOSMEM: ds_write_b64			; TOSMEM: ds_write_b64

	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_add_u32 m0, s3, 0x300			; TOSMEM: s_add_u32 m0, s3, 0x300
	; TOSMEM: s_buffer_load_dword s0, s[88:91], m0 ; 4-byte Folded Reload			; TOSMEM: s_buffer_load_dword s0, s[88:91], m0 ; 4-byte Folded Reload
	; TOSMEM-NOT: m0			; FIXME-TOSMEM-NOT: m0
	; TOSMEM: s_waitcnt lgkmcnt(0)			; TOSMEM: s_waitcnt lgkmcnt(0)
	; TOSMEM-NOT: m0			; TOSMEM-NOT: m0
	; TOSMEM: s_mov_b32 m0, s0			; TOSMEM: s_mov_b32 m0, s0
	; TOSMEM: ; use m0			; TOSMEM: ; use m0

	; TOSMEM: s_dcache_wb			; TOSMEM: s_dcache_wb
	; TOSMEM: s_endpgm			; TOSMEM: s_endpgm
	define amdgpu_kernel void @restore_m0_lds(i32 %arg) {			define amdgpu_kernel void @restore_m0_lds(i32 %arg) {
	Show All 21 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Merge M0 initializationsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 96037

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

lib/Target/AMDGPU/SIRegisterInfo.cpp

test/CodeGen/AMDGPU/merge-m0.mir

test/CodeGen/AMDGPU/spill-m0.ll

[AMDGPU] Merge M0 initializations
ClosedPublic