This is an archive of the discontinued LLVM Phabricator instance.

lib/CodeGen/MachineLICM.cpp
805	Could you add a comment on the choice of the magic constant? It’s the same used in the loop vectorizer IIRC, but having some hints why we choose that is a good thing. Would it may send to make the denominator parametrizable so that it is easy to check different threshold?
849	Shouldn’t we have a threshold in the other direction as well, i.e., Freq(B) < Freq(Preheader) * <some threshold>?
test/CodeGen/X86/sink-cheap-instructions.ll
25	What is the purpose of this change?

djasper updated this revision to Diff 22305.Mar 19 2015, 2:43 PM

djasper added inline comments.

lib/CodeGen/MachineLICM.cpp
805	Done.
849	Could you elaborate what that might accomplish? Also the frequency of the preheader doesn't really relate strongly to the frequencies within the loop AFAICT.
test/CodeGen/X86/sink-cheap-instructions.ll
25	The purpose is to ensure that the getelementptr for %6 is not pulled into the loop as it is used in every iteration. I realize that I have forgotten to actually test that. Fixed.

qcolombet added inline comments.Mar 19 2015, 3:30 PM

lib/CodeGen/MachineLICM.cpp
849	Sure, but the instructions we are moving come from the pre header and we do not want them to explode the number of time they are executed. Let say the pre-header is executed once and the loop 1,000,000 times. Then, even in the cold path of the loop, this is still rather expensive to sink the instructions.

djasper added inline comments.Mar 20 2015, 2:15 AM

lib/CodeGen/MachineLICM.cpp
849	I think we will almost never have accurate information about that. And the cost model is also not easy. We are generally weighing up the cost of the additional computation in each loop vs. the cost of the additional live ranges of these registers and the register spills that might be a result. The latter might actually make the computation inside the loop more expensive as demonstrated in the changed test code. Here, we are very frequently accessing one of the struct's fields. If we hoist all of the GEPs out of the loop, we end up spilling them all onto the stack and each load becomes a load from memory. With this change, we keep the GEPs of the less frequently accessed fields inside the loops (where they are folded into the LEAs and potentially don't cause significant overhead over loading them from the stack onto which they would be spilled). We still pull out the GEP for the frequently accessed field and can actually keep that in a register using only a cheap MOV inside the loop. I think the long-term solution might be to order all instructions we could sink by frequency and then sink them into the loop starting from the least frequently executed until there is no more register pressure. Not entirely sure how to implement that correctly yet. My hope is that this patch is an incremental step towards that.

Turns out the IsCopy check was actually important and papered over the fact that we cannot sink something that is used by a PHI (because we sink to before the first non-PHI instruction later). Sinking something that is used by a PHI probably also doesn't make much sense (or at least needs a separate investigation) as e.g. we might sink along the loop-entry edge.

Also simplified the implementation.

qcolombet added inline comments.Mar 20 2015, 10:05 AM

lib/CodeGen/MachineLICM.cpp
835	Aborting on PHI does not make much sense. Instead, you should look for the common dominator for the use of the PHI, e.g., by looking at the dominator of the terminator of the related block. In that case, you wouldn’t insert at after the first non-PHI instruction.
849	We are generally weighing up the cost of the additional computation in each loop vs. the cost of the additional live ranges of these registers and the register spills that might be a result. I agree, but checking for frequencies to derive a heuristic for that sounds wrong to me. To simplify, let us assume that the spiller would insert the spill instructions at the exact same locations as the sinking algorithm. The first question we should ask ourselves is: do the reload instructions are more expensive than the things we sunk? If the answer is no, then there is no point in sinking. Frequency does not help for that. Second, when do we consider the live-range we shorten/extend? E.g., what if you sink a = add b<kill> + c<kill>? You end up increasing the register pressure by one in the whole loop body and pushed the instruction in a more expensive location. Frequency does not help for that too. The bottom line is, I believe that frequency has nothing to do, with the heuristic we want, to achieve better spill placement. Therefore, I do not think this is a step toward fixing the problem. I am fine if you want to do experimentation to gather ideas, but seeing gains out of that study seems more luck based than actual improvements.

djasper added inline comments.Mar 20 2015, 3:25 PM

lib/CodeGen/MachineLICM.cpp
835	I don't think I understand this. We are talking about an instruction that is currently in the loop preheader and its use is in a PHI. Doesn't that (at least in the vast majority of cases) mean that we are needing the value only when branching from the loop preheader into the loop? Thus, the instruction is currently at the best possible place.
849	The first question we should ask ourselves is: do the reload instructions are more expensive than the things we sunk? If the answer is no, then there is no point in sinking. Frequency does not help for that. I am not sure it is as easy. Register spills have other costs as well. At the very least, they are increasing the stack frame size and incur the cost of writing to the stack. If the instructions are only executed a handful of times, as in the case when they are unlikely to be executed, this additional cost is significant. The model I have in mind in the long run is the following. Lets call the difference between instruction cost and cost of spill-reload C. It is the additional cost that a sunken instruction will incur each time it is executed in the loop. You are saying that we should sink all the instructions where C is negative (spill reload is cheaper). This is correct to a certain extent although we have to factor in register pressure because we might not need to spill at all. But that a bit of a separate consideration. Lets further assume that the likelihood of execution is L. Now, for positive C, we should sink instructions if C*L is smaller than some threshold that factors in the cost of the spill itself which I mentioned above. It is the latter part that I am getting at with this patch and I do think that looking at probabilities is a step into the right direction (although maybe not the most important first step). Now, to explain where I am coming from. I have code with a loop, which basically does deserialization and inside the loop there is a large switch statement. In practice, each of the spilled registers is usually accessed either 0 or 1 times. Thus, sinking is very important. Also, there is already code in MachineLICM, which kind of does the same thing in line 743. It basically checks whether the loop header has more than 25 successors in which case it cops out of hoisting instructions. Now, of course 25 is significantly greater than the 20% probability I am using. But also, just looking at the successor count is not very helpful as I think this condition would not fire if the loop contained a switch statement inside a surrounding branch in the loop body. We should probably also make this decision based on execution probabilities.

Hi Daniel,

Register spills have other costs as well. At the very least, they are increasing the stack frame size and incur the cost of writing to the stack. If the instructions are only executed a handful of times, as in the case when they are unlikely to be executed, this additional cost is significant.

Ok I see where you are going now. That being said, I think that the overhead of stores + stack update (i.e., what we are talking about) is not that relevant. Anyway, I agree that it should be taken into account at some point.

You are saying that we should sink all the instructions where C is negative (spill reload is cheaper).

The opposite :). We should *not* sink those instructions. Of course, we agree that this is useless to sink if register pressure is not a problem, which we do not check at all here.

Now, for positive C, we should sink instructions if C*L is smaller than some threshold that factors in the cost of the spill itself which I mentioned above. It is the latter part that I am getting at with this patch and I do think that looking at probabilities is a step into the right direction (although maybe not the most important first step).

Aside from register pressure, which now I understand you want to consider later, I am still not convinced that the suggested check represent a meaningful cost model.
Anyhow, we can rework that when we will look at the register pressure thing.
The bottom line is I am fine with whatever path you think is worth pursuing.

Cheers,
-Quentin

lib/CodeGen/MachineLICM.cpp
835	Shouldn’t have we filter out those cases with this check: !HasLoopPHIUse(I) I thought we were speaking of a PHI within a diamond in the loop, not the PHI of the header.

In D8451#144515, @qcolombet wrote:

Hi Daniel,

Register spills have other costs as well. At the very least, they are increasing the stack frame size and incur the cost of writing to the stack. If the instructions are only executed a handful of times, as in the case when they are unlikely to be executed, this additional cost is significant.

Ok I see where you are going now. That being said, I think that the overhead of stores + stack update (i.e., what we are talking about) is not that relevant. Anyway, I agree that it should be taken into account at some point.

I think that largely depends on whether we think the instruction inside the loop is normally executed (more than once).

You are saying that we should sink all the instructions where C is negative (spill reload is cheaper).

The opposite :). We should *not* sink those instructions. Of course, we agree that this is useless to sink if register pressure is not a problem, which we do not check at all here.

Yeah, sorry, I meant we should sink the instructions where spill reload is more expensive than the actual instruction (as long as there is register pressure). I think I should look first look at register pressure first. Will do.

Now, for positive C, we should sink instructions if C*L is smaller than some threshold that factors in the cost of the spill itself which I mentioned above. It is the latter part that I am getting at with this patch and I do think that looking at probabilities is a step into the right direction (although maybe not the most important first step).

Aside from register pressure, which now I understand you want to consider later, I am still not convinced that the suggested check represent a meaningful cost model.
Anyhow, we can rework that when we will look at the register pressure thing.
The bottom line is I am fine with whatever path you think is worth pursuing.

Cheers,
-Quentin

lib/CodeGen/MachineLICM.cpp
835	Ah, right. Thanks for pointing that out. So, the trouble I was actually running into was that there is a PHI use outside of the loop. I'll need to look closer at the cases where that can actually happen. Will update as soon as I have more information.

Revision Contents

Path

Size

lib/

CodeGen/

MachineLICM.cpp

57 lines

test/

CodeGen/

X86/

sink-cheap-instructions.ll

8 lines

Diff 22347

lib/CodeGen/MachineLICM.cpp

Show All 19 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/CodeGen/Passes.h"		#include "llvm/CodeGen/Passes.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
		#include "llvm/CodeGen/MachineBlockFrequencyInfo.h"
#include "llvm/CodeGen/MachineDominators.h"		#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineLoopInfo.h"		#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/CodeGen/MachineMemOperand.h"		#include "llvm/CodeGen/MachineMemOperand.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/PseudoSourceValue.h"		#include "llvm/CodeGen/PseudoSourceValue.h"
#include "llvm/MC/MCInstrItineraries.h"		#include "llvm/MC/MCInstrItineraries.h"
		#include "llvm/Support/BranchProbability.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetInstrInfo.h"		#include "llvm/Target/TargetInstrInfo.h"
#include "llvm/Target/TargetLowering.h"		#include "llvm/Target/TargetLowering.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
#include "llvm/Target/TargetRegisterInfo.h"		#include "llvm/Target/TargetRegisterInfo.h"
#include "llvm/Target/TargetSubtargetInfo.h"		#include "llvm/Target/TargetSubtargetInfo.h"
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "machine-licm"		#define DEBUG_TYPE "machine-licm"

static cl::opt<bool>		static cl::opt<bool>
AvoidSpeculation("avoid-speculation",		AvoidSpeculation("avoid-speculation",
cl::desc("MachineLICM should avoid speculation"),		cl::desc("MachineLICM should avoid speculation"),
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

static cl::opt<bool>		static cl::opt<bool>
HoistCheapInsts("hoist-cheap-insts",		HoistCheapInsts("hoist-cheap-insts",
cl::desc("MachineLICM should hoist even cheap instructions"),		cl::desc("MachineLICM should hoist even cheap instructions"),
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

static cl::opt<bool>		static cl::opt<bool> SinkInstsToAvoidSpills(
SinkInstsToAvoidSpills("sink-insts-to-avoid-spills",		"sink-insts-to-avoid-spills",
cl::desc("MachineLICM should sink instructions into "		cl::desc("MachineLICM should sink instructions into loops to avoid "
"loops to avoid register spills"),		"register spills"),
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

		static cl::opt<uint32_t> SinkThresholdDenominator(
		"sink-threshold-denominator",
		cl::desc("Sink instrutions into the loop only if their probability of "
		"being executed in each iteration is less than 1/N"),
		cl::init(5), cl::Hidden);

STATISTIC(NumHoisted,		STATISTIC(NumHoisted,
"Number of machine instructions hoisted out of loops");		"Number of machine instructions hoisted out of loops");
STATISTIC(NumLowRP,		STATISTIC(NumLowRP,
"Number of instructions hoisted in low reg pressure situation");		"Number of instructions hoisted in low reg pressure situation");
STATISTIC(NumHighLatency,		STATISTIC(NumHighLatency,
"Number of high latency instructions hoisted");		"Number of high latency instructions hoisted");
STATISTIC(NumCSEed,		STATISTIC(NumCSEed,
"Number of hoisted machine instructions CSEed");		"Number of hoisted machine instructions CSEed");
STATISTIC(NumPostRAHoisted,		STATISTIC(NumPostRAHoisted,
"Number of machine instructions hoisted out of loops post regalloc");		"Number of machine instructions hoisted out of loops post regalloc");

namespace {		namespace {
class MachineLICM : public MachineFunctionPass {		class MachineLICM : public MachineFunctionPass {
const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetLoweringBase *TLI;		const TargetLoweringBase *TLI;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
		const MachineBlockFrequencyInfo *MBFI;
const MachineFrameInfo *MFI;		const MachineFrameInfo *MFI;
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
const InstrItineraryData *InstrItins;		const InstrItineraryData *InstrItins;
bool PreRegAlloc;		bool PreRegAlloc;

// Various analyses that we use...		// Various analyses that we use...
AliasAnalysis *AA; // Alias analysis info.		AliasAnalysis *AA; // Alias analysis info.
MachineLoopInfo *MLI; // Current MachineLoopInfo		MachineLoopInfo *MLI; // Current MachineLoopInfo
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	public:
explicit MachineLICM(bool PreRA) :		explicit MachineLICM(bool PreRA) :
MachineFunctionPass(ID), PreRegAlloc(PreRA) {		MachineFunctionPass(ID), PreRegAlloc(PreRA) {
initializeMachineLICMPass(*PassRegistry::getPassRegistry());		initializeMachineLICMPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
		AU.addRequired<MachineBlockFrequencyInfo>();
AU.addRequired<MachineLoopInfo>();		AU.addRequired<MachineLoopInfo>();
AU.addRequired<MachineDominatorTree>();		AU.addRequired<MachineDominatorTree>();
AU.addRequired<AliasAnalysis>();		AU.addRequired<AliasAnalysis>();
AU.addPreserved<MachineLoopInfo>();		AU.addPreserved<MachineLoopInfo>();
AU.addPreserved<MachineDominatorTree>();		AU.addPreserved<MachineDominatorTree>();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}

▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	private:
MachineBasicBlock *getCurPreheader();		MachineBasicBlock *getCurPreheader();
};		};
} // end anonymous namespace		} // end anonymous namespace

char MachineLICM::ID = 0;		char MachineLICM::ID = 0;
char &llvm::MachineLICMID = MachineLICM::ID;		char &llvm::MachineLICMID = MachineLICM::ID;
INITIALIZE_PASS_BEGIN(MachineLICM, "machinelicm",		INITIALIZE_PASS_BEGIN(MachineLICM, "machinelicm",
"Machine Loop Invariant Code Motion", false, false)		"Machine Loop Invariant Code Motion", false, false)
		INITIALIZE_PASS_DEPENDENCY(MachineBranchProbabilityInfo)
INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)		INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)
INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)		INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
INITIALIZE_AG_DEPENDENCY(AliasAnalysis)		INITIALIZE_AG_DEPENDENCY(AliasAnalysis)
INITIALIZE_PASS_END(MachineLICM, "machinelicm",		INITIALIZE_PASS_END(MachineLICM, "machinelicm",
"Machine Loop Invariant Code Motion", false, false)		"Machine Loop Invariant Code Motion", false, false)

/// LoopIsOuterMostWithPredecessor - Test if the given loop is the outer-most		/// LoopIsOuterMostWithPredecessor - Test if the given loop is the outer-most
/// loop that has a unique predecessor.		/// loop that has a unique predecessor.
Show All 12 Lines
bool MachineLICM::runOnMachineFunction(MachineFunction &MF) {		bool MachineLICM::runOnMachineFunction(MachineFunction &MF) {
if (skipOptnoneFunction(*MF.getFunction()))		if (skipOptnoneFunction(*MF.getFunction()))
return false;		return false;

Changed = FirstInLoop = false;		Changed = FirstInLoop = false;
TII = MF.getSubtarget().getInstrInfo();		TII = MF.getSubtarget().getInstrInfo();
TLI = MF.getSubtarget().getTargetLowering();		TLI = MF.getSubtarget().getTargetLowering();
TRI = MF.getSubtarget().getRegisterInfo();		TRI = MF.getSubtarget().getRegisterInfo();
		MBFI = &getAnalysis<MachineBlockFrequencyInfo>();
MFI = MF.getFrameInfo();		MFI = MF.getFrameInfo();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
InstrItins = MF.getSubtarget().getInstrItineraryData();		InstrItins = MF.getSubtarget().getInstrItineraryData();

PreRegAlloc = MRI->isSSA();		PreRegAlloc = MRI->isSSA();

if (PreRegAlloc)		if (PreRegAlloc)
DEBUG(dbgs() << "******** Pre-regalloc Machine LICM: ");		DEBUG(dbgs() << "******** Pre-regalloc Machine LICM: ");
▲ Show 20 Lines • Show All 434 Lines • ▼ Show 20 Lines	void MachineLICM::HoistOutOfLoop(MachineDomTreeNode *HeaderN) {
}		}
}		}

void MachineLICM::SinkIntoLoop() {		void MachineLICM::SinkIntoLoop() {
MachineBasicBlock *Preheader = getCurPreheader();		MachineBasicBlock *Preheader = getCurPreheader();
if (!Preheader)		if (!Preheader)
return;		return;

		// We are sinking instructions into the loop only if they aren't 'likely' to
		qcolombetUnsubmitted Not Done Reply Inline Actions Could you add a comment on the choice of the magic constant? It’s the same used in the loop vectorizer IIRC, but having some hints why we choose that is a good thing. Would it may send to make the denominator parametrizable so that it is easy to check different threshold? qcolombet: Could you add a comment on the choice of the magic constant? It’s the same used in the loop…
		djasperAuthorUnsubmitted Not Done Reply Inline Actions Done. djasper: Done.
		// be executed during every iteration.
		// FIXME: Instead of having a fix threshold, the probability could be a
		// factor of the a cost model.
		const BranchProbability ColdProb(1, 5);
		const BlockFrequency ThresholdFreq =
		MBFI->getBlockFreq(CurLoop->getHeader()) *
		BranchProbability(1, SinkThresholdDenominator);

SmallVector<MachineInstr *, 8> Candidates;		SmallVector<MachineInstr *, 8> Candidates;
for (MachineBasicBlock::instr_iterator I = Preheader->instr_begin();		for (MachineBasicBlock::instr_iterator I = Preheader->instr_begin();
I != Preheader->instr_end(); ++I) {		I != Preheader->instr_end(); ++I) {
// We need to ensure that we can safely move this instruction into the loop.		// We need to ensure that we can safely move this instruction into the loop.
// As such, it must not have side-effects, e.g. such as a call has.		// As such, it must not have side-effects, e.g. such as a call has.
if (IsLoopInvariantInst(*I) && !HasLoopPHIUse(I))		if (IsLoopInvariantInst(*I) && !HasLoopPHIUse(I))
Candidates.push_back(I);		Candidates.push_back(I);
}		}

for (MachineInstr *I : Candidates) {		for (MachineInstr *I : Candidates) {
const MachineOperand &MO = I->getOperand(0);		const MachineOperand &MO = I->getOperand(0);
if (!MO.isDef() \|\| !MO.isReg() \|\| !MO.getReg())		if (!MO.isDef() \|\| !MO.isReg() \|\| !MO.getReg())
continue;		continue;
if (!MRI->hasOneDef(MO.getReg()))		if (!MRI->hasOneDef(MO.getReg()))
continue;		continue;
bool CanSink = true;
MachineBasicBlock *B = nullptr;		MachineBasicBlock *B = nullptr;
for (MachineInstr &MI : MRI->use_instructions(MO.getReg())) {		for (MachineInstr &MI : MRI->use_instructions(MO.getReg())) {
// FIXME: Come up with a proper cost model that estimates whether sinking		if (MI.isPHI()) {
// the instruction (and thus possibly executing it on every loop		B = nullptr;
// iteration) is more expensive than a register.
// For now assumes that copies are cheap and thus almost always worth it.
if (!MI.isCopy()) {
CanSink = false;
break;		break;
}		}
if (!B) {		B = B ? DT->findNearestCommonDominator(B, MI.getParent())
		qcolombetUnsubmitted Not Done Reply Inline Actions Aborting on PHI does not make much sense. Instead, you should look for the common dominator for the use of the PHI, e.g., by looking at the dominator of the terminator of the related block. In that case, you wouldn’t insert at after the first non-PHI instruction. qcolombet: Aborting on PHI does not make much sense. Instead, you should look for the common dominator for…
		djasperAuthorUnsubmitted Not Done Reply Inline Actions I don't think I understand this. We are talking about an instruction that is currently in the loop preheader and its use is in a PHI. Doesn't that (at least in the vast majority of cases) mean that we are needing the value only when branching from the loop preheader into the loop? Thus, the instruction is currently at the best possible place. djasper: I don't think I understand this. We are talking about an instruction that is currently in the…
		qcolombetUnsubmitted Not Done Reply Inline Actions Shouldn’t have we filter out those cases with this check: !HasLoopPHIUse(I) I thought we were speaking of a PHI within a diamond in the loop, not the PHI of the header. qcolombet: Shouldn’t have we filter out those cases with this check: !HasLoopPHIUse(I) I thought we were…
		djasperAuthorUnsubmitted Not Done Reply Inline Actions Ah, right. Thanks for pointing that out. So, the trouble I was actually running into was that there is a PHI use outside of the loop. I'll need to look closer at the cases where that can actually happen. Will update as soon as I have more information. djasper: Ah, right. Thanks for pointing that out. So, the trouble I was actually running into was that…
B = MI.getParent();		: MI.getParent();
continue;		if (!B)
}
B = DT->findNearestCommonDominator(B, MI.getParent());
if (!B) {
CanSink = false;
break;		break;
}		}
}		if (!B \|\| B == Preheader)
if (!CanSink \|\| !B \|\| B == Preheader)		continue;

		if (MBFI->getBlockFreq(B) > ThresholdFreq)
continue;		continue;

B->splice(B->getFirstNonPHI(), Preheader, I);		B->splice(B->getFirstNonPHI(), Preheader, I);
}		}
}		}

		qcolombetUnsubmitted Not Done Reply Inline Actions Shouldn’t we have a threshold in the other direction as well, i.e., Freq(B) < Freq(Preheader) * <some threshold>? qcolombet: Shouldn’t we have a threshold in the other direction as well, i.e., Freq(B) < Freq(Preheader) *…
		djasperAuthorUnsubmitted Not Done Reply Inline Actions Could you elaborate what that might accomplish? Also the frequency of the preheader doesn't really relate strongly to the frequencies within the loop AFAICT. djasper: Could you elaborate what that might accomplish? Also the frequency of the preheader doesn't…
		qcolombetUnsubmitted Not Done Reply Inline Actions Sure, but the instructions we are moving come from the pre header and we do not want them to explode the number of time they are executed. Let say the pre-header is executed once and the loop 1,000,000 times. Then, even in the cold path of the loop, this is still rather expensive to sink the instructions. qcolombet: Sure, but the instructions we are moving come from the pre header and we do not want them to…
		djasperAuthorUnsubmitted Not Done Reply Inline Actions I think we will almost never have accurate information about that. And the cost model is also not easy. We are generally weighing up the cost of the additional computation in each loop vs. the cost of the additional live ranges of these registers and the register spills that might be a result. The latter might actually make the computation inside the loop more expensive as demonstrated in the changed test code. Here, we are very frequently accessing one of the struct's fields. If we hoist all of the GEPs out of the loop, we end up spilling them all onto the stack and each load becomes a load from memory. With this change, we keep the GEPs of the less frequently accessed fields inside the loops (where they are folded into the LEAs and potentially don't cause significant overhead over loading them from the stack onto which they would be spilled). We still pull out the GEP for the frequently accessed field and can actually keep that in a register using only a cheap MOV inside the loop. I think the long-term solution might be to order all instructions we could sink by frequency and then sink them into the loop starting from the least frequently executed until there is no more register pressure. Not entirely sure how to implement that correctly yet. My hope is that this patch is an incremental step towards that. djasper: I think we will almost never have accurate information about that. And the cost model is also…
		qcolombetUnsubmitted Not Done Reply Inline Actions We are generally weighing up the cost of the additional computation in each loop vs. the cost of the additional live ranges of these registers and the register spills that might be a result. I agree, but checking for frequencies to derive a heuristic for that sounds wrong to me. To simplify, let us assume that the spiller would insert the spill instructions at the exact same locations as the sinking algorithm. The first question we should ask ourselves is: do the reload instructions are more expensive than the things we sunk? If the answer is no, then there is no point in sinking. Frequency does not help for that. Second, when do we consider the live-range we shorten/extend? E.g., what if you sink a = add b<kill> + c<kill>? You end up increasing the register pressure by one in the whole loop body and pushed the instruction in a more expensive location. Frequency does not help for that too. The bottom line is, I believe that frequency has nothing to do, with the heuristic we want, to achieve better spill placement. Therefore, I do not think this is a step toward fixing the problem. I am fine if you want to do experimentation to gather ideas, but seeing gains out of that study seems more luck based than actual improvements. qcolombet: > We are generally weighing up the cost of the additional computation in each loop vs. the cost…
		djasperAuthorUnsubmitted Not Done Reply Inline Actions The first question we should ask ourselves is: do the reload instructions are more expensive than the things we sunk? If the answer is no, then there is no point in sinking. Frequency does not help for that. I am not sure it is as easy. Register spills have other costs as well. At the very least, they are increasing the stack frame size and incur the cost of writing to the stack. If the instructions are only executed a handful of times, as in the case when they are unlikely to be executed, this additional cost is significant. The model I have in mind in the long run is the following. Lets call the difference between instruction cost and cost of spill-reload C. It is the additional cost that a sunken instruction will incur each time it is executed in the loop. You are saying that we should sink all the instructions where C is negative (spill reload is cheaper). This is correct to a certain extent although we have to factor in register pressure because we might not need to spill at all. But that a bit of a separate consideration. Lets further assume that the likelihood of execution is L. Now, for positive C, we should sink instructions if CL is smaller than some threshold that factors in the cost of the spill itself which I mentioned above. It is the latter part that I am getting at with this patch and I do think that looking at probabilities is a step into the right direction (although maybe not the most important first step). Now, to explain where I am coming from. I have code with a loop, which basically does deserialization and inside the loop there is a large switch statement. In practice, each of the spilled registers is usually accessed either 0 or 1 times. Thus, sinking is very important. Also, there is already code in MachineLICM, which kind of does the same thing in line 743. It basically checks whether the loop header has more than 25 successors in which case it cops out of hoisting instructions. Now, of course 25 is significantly greater than the 20% probability I am using. But also, just looking at the successor count is not very helpful as I think this condition would not fire if the loop contained a switch statement inside a surrounding branch in the loop body. We should probably also make this decision based on execution probabilities. djasper:* > The first question we should ask ourselves is: do the reload instructions are more expensive…
static bool isOperandKill(const MachineOperand &MO, MachineRegisterInfo *MRI) {		static bool isOperandKill(const MachineOperand &MO, MachineRegisterInfo *MRI) {
return MO.isKill() \|\| MRI->hasOneNonDBGUse(MO.getReg());		return MO.isKill() \|\| MRI->hasOneNonDBGUse(MO.getReg());
}		}

/// getRegisterClassIDAndCost - For a given MI, register, and the operand		/// getRegisterClassIDAndCost - For a given MI, register, and the operand
/// index, return the ID and cost of its representative register class.		/// index, return the ID and cost of its representative register class.
void		void
MachineLICM::getRegisterClassIDAndCost(const MachineInstr *MI,		MachineLICM::getRegisterClassIDAndCost(const MachineInstr *MI,
▲ Show 20 Lines • Show All 694 Lines • Show Last 20 Lines

test/CodeGen/X86/sink-cheap-instructions.ll

	; RUN: llc < %s -mtriple=x86_64-linux \| FileCheck %s -check-prefix=CHECK			; RUN: llc < %s -mtriple=x86_64-linux \| FileCheck %s -check-prefix=CHECK
	; RUN: llc < %s -mtriple=x86_64-linux -sink-insts-to-avoid-spills \| FileCheck %s -check-prefix=SINK			; RUN: llc < %s -mtriple=x86_64-linux -sink-insts-to-avoid-spills \| FileCheck %s -check-prefix=SINK

	; Ensure that we sink copy-like instructions into loops to avoid register			; Ensure that we sink copy-like instructions into loops to avoid register
	; spills.			; spills.

	; CHECK: Spill			; CHECK: Spill
	; SINK-NOT: Spill			; SINK-NOT: Spill
				; SINK: lea{{.*}}20
				; SINK-NEXT: jmp

	%struct.A = type { i32, i32, i32, i32, i32, i32 }			%struct.A = type { i32, i32, i32, i32, i32, i32 }

	define void @_Z1fPhP1A(i8* nocapture readonly %input, %struct.A* %a) {			define void @_Z1fPhP1A(i8* nocapture readonly %input, %struct.A* %a) {
	%1 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 0			%1 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 0
	%2 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 1			%2 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 1
	%3 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 2			%3 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 2
	%4 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 3			%4 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 3
	%5 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 4			%5 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 4
	%6 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 5			%6 = getelementptr inbounds %struct.A, %struct.A* %a, i64 0, i32 5
	br label %.backedge			br label %.backedge

	.backedge:			.backedge:
	%.0 = phi i8* [ %input, %0 ], [ %7, %.backedge.backedge ]			%.0 = phi i8* [ %input, %0 ], [ %7, %.backedge.backedge ]
				tail call void @_Z6assignPj(i32* %6)
				qcolombetUnsubmitted Not Done Reply Inline Actions What is the purpose of this change? qcolombet: What is the purpose of this change?
				djasperAuthorUnsubmitted Not Done Reply Inline Actions The purpose is to ensure that the getelementptr for %6 is not pulled into the loop as it is used in every iteration. I realize that I have forgotten to actually test that. Fixed. djasper: The purpose is to ensure that the getelementptr for %6 is not pulled into the loop as it is…
	%7 = getelementptr inbounds i8, i8* %.0, i64 1			%7 = getelementptr inbounds i8, i8* %.0, i64 1
	%8 = load i8, i8* %7, align 1			%8 = load i8, i8* %7, align 1
	switch i8 %8, label %.backedge.backedge [			switch i8 %8, label %.backedge.backedge [
	i8 0, label %9			i8 0, label %9
	i8 10, label %10			i8 10, label %10
	i8 20, label %11			i8 20, label %11
	i8 30, label %12			i8 30, label %12
	i8 40, label %13			i8 40, label %13
	i8 50, label %14
	]			]

	; <label>:9			; <label>:9
	tail call void @_Z6assignPj(i32* %1)			tail call void @_Z6assignPj(i32* %1)
	br label %.backedge.backedge			br label %.backedge.backedge

	; <label>:10			; <label>:10
	tail call void @_Z6assignPj(i32* %2)			tail call void @_Z6assignPj(i32* %2)
	br label %.backedge.backedge			br label %.backedge.backedge

	.backedge.backedge:			.backedge.backedge:
	br label %.backedge			br label %.backedge

	; <label>:11			; <label>:11
	tail call void @_Z6assignPj(i32* %3)			tail call void @_Z6assignPj(i32* %3)
	br label %.backedge.backedge			br label %.backedge.backedge

	; <label>:12			; <label>:12
	tail call void @_Z6assignPj(i32* %4)			tail call void @_Z6assignPj(i32* %4)
	br label %.backedge.backedge			br label %.backedge.backedge

	; <label>:13			; <label>:13
	tail call void @_Z6assignPj(i32* %5)			tail call void @_Z6assignPj(i32* %5)
	br label %.backedge.backedge			br label %.backedge.backedge

	; <label>:14
	tail call void @_Z6assignPj(i32* %6)
	br label %.backedge.backedge
	}			}

	declare void @_Z6assignPj(i32*)			declare void @_Z6assignPj(i32*)

This is an archive of the discontinued LLVM Phabricator instance.

[MachineLICM] Sink instructions only if they are unlikely to be executedNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 22347

lib/CodeGen/MachineLICM.cpp

test/CodeGen/X86/sink-cheap-instructions.ll

[MachineLICM] Sink instructions only if they are unlikely to be executed
Needs ReviewPublic