Download Raw Diff

Details

Reviewers

myatsina
MatzeB
atrick
mkuper

Commits

rG578cf7aae755: [ExecutionDepsFix] Improve clearance calculation for loops
rL293571: [ExecutionDepsFix] Improve clearance calculation for loops

Summary

In revision rL278321, ExecutionDepsFix learned how to pick a better
register for undef register reads, e.g. for instructions such as
vcvtsi2sdq. While this revision improved performance on a good number
of our benchmarks, it unfortunately also caused significant regressions
(up to 3x) on others. This regression turned out to be caused by loops
such as:

PH -> A -> B (xmm<Undef> -> xmm<Def>) -> C -> D -> EXIT
      ^                                  |
      +----------------------------------+

In the previous version of the clearance calculation, we would visit
the blocks in order, remembering for each whether there were any
incoming backedges from blocks that we hadn't processed yet and if
so queuing up the block to be re-processed. However, for loop structures
such as the above, this is clearly insufficient, since the block B
does not have any unknown backedges, so we do not see the false
dependency from the previous interation's Def of xmm registers in B.

To fix this, we need to consider all blocks that are part of the loop
and reprocess them one the correct clearance values are known. As
an optimization, we also want to avoid reprocessing any later blocks
that are not part of the loop.

In summary, the iteration order is as follows:
Before: PH A B C D A'
Corrected (Naive): PH A B C D A' B' C' D'
Corrected (w/ optimization): PH A B C A' B' C' D

To facilitate this optimization we introduce two new counters for each
basic block. The first counts how many of it's predecssors have
completed primary processing. The second counts how many of its
predecessors have completed all processing (we will call such a block
*done*. Now, the criteria to reprocess a block is as follows:

All Predecessors have completed primary processing
For x the number of predecessors that have completed primary processing *at the time of primary processing of this block*, the number of predecessors that are done has reached x.

The intuition behind this criterion is as follows:
We need to perform primary processing on all predecessors in order to
find out any direct defs in those predecessors. When predecessors are
done, we also know that we have information about indirect defs (e.g.
in block B though that were inherited through B->C->A->B). However,
we can't wait for all predecessors to be done, since that would
cause cyclic dependencies. However, it is guaranteed that all those
predecessors that are prior to us in reverse postorder will be done
before us. Since we iterate of the basic blocks in reverse postorder,
the number x above, is precisely the count of the number of predecessors
prior to us in reverse postorder.

Diff Detail

Repository: rL LLVM

Event Timeline

loladiro updated this revision to Diff 84518.Jan 15 2017, 9:32 PM

loladiro retitled this revision from to [ExecutionDepsFix] Improve clearance calculation for loops.

loladiro updated this object.

loladiro added reviewers: MatzeB, myatsina, mkuper, atrick.

loladiro added a subscriber: llvm-commits.

loladiro updated this object.Jan 15 2017, 9:33 PM

vchuravy added a subscriber: vchuravy.Jan 15 2017, 9:48 PM

sanjoy added a subscriber: sanjoy.Jan 15 2017, 9:49 PM

myatsina added inline comments.Jan 16 2017, 8:01 AM

lib/CodeGen/ExecutionDepsFix.cpp
146 ↗	(On Diff #84518)	Not --> Note ?
413 ↗	(On Diff #84518)	Can you add an assertion message please?
415 ↗	(On Diff #84518)	Is OutRegs null when it's a back edge from a BB we haven't seen yet? Are there other cases where it can be null? Please add a comment explaining the cases.
456 ↗	(On Diff #84518)	Some of the comments in this function need to be updated (we no longer have LiveOuts, we always change the defs to be ralive to the end of the block etc)
571 ↗	(On Diff #84518)	For readability purpose - how about changing "BlockDone" to "breakDependency" and add your comment regarding done blocks before the call to processDefs? processDefs() will look like this: if (breakDependency) { // calc Pref ... } and processBasicBlock will look like this: // If this block is not done, it makes little sense ... bool breakDependency = isBlockDone(Done) processDefs(MI, breakDependency, ...)
792 ↗	(On Diff #84518)	Do you need "Done" here? I don't see you using it.
802 ↗	(On Diff #84518)	Is there a point going over it when we're not isBlockDone? processDefs pushes instructions into the undef read only when Done =true. processUndefReads does break if the undef reads are empty, but perhaps for readability purpose it's worth writing this explicitly.
819 ↗	(On Diff #84518)	Why not check isBasicBlockDone on MBB?
822 ↗	(On Diff #84518)	At first glance, it wasn't clear to me why you need it here and why can't you just do one "primary" pass and then then process the basic blocks that are still not done. If I understood correctly, you're doing it here for the optimization you've talked about (Making sure the order is: PH A B C A' B' C' D). Am I right? I would add a comment here elaborating that. I would even consider adding as a comment somewhere with the loop example from the description of this patch and the "optimized" order the algorithm visits the nodes. I think it is a great example and it will make the traverse order here much clearer.
868 ↗	(On Diff #84518)	Better use a constructor with default initialization like DomainValue does.
878 ↗	(On Diff #84518)	I would add a comment that IncomingProcessed and IncomingCompleted of this block were already updated during the processing of predecessor blocks.

loladiro mentioned this in D28786: [ExecutionDepsFix] Kill clearance at function entry/calls.Jan 16 2017, 3:49 PM

loladiro added a child revision: D28786: [ExecutionDepsFix] Kill clearance at function entry/calls.Jan 16 2017, 3:49 PM

Address review comments

Fix small typo in assertion message

lib/CodeGen/ExecutionDepsFix.cpp
415 ↗	(On Diff #84518)	Yes, that's correct. If it's null it's a back edge. Will add a comment to this extent.
819 ↗	(On Diff #84518)	Just to avoid doing the calculation twice. Probably premature optimization. Will simplify.
822 ↗	(On Diff #84518)	If I understood correctly, you're doing it here for the optimization you've talked about (Making sure the order is: PH A B C A' B' C' D). Am I right? Yes, that's correct. It's discussed below when going over the blocks that are not done. I'll add a small comment here. I will add the loop example from the commit message below and add a small comment here pointing people to the discussion below.

Forgot to run clang-format

loladiro added a child revision: D28915: [ExecutionDepsFix] Optimize instruction insertion.Jan 19 2017, 1:23 PM

Added a few minor comments.

LGTM once they are addressed.

lib/CodeGen/ExecutionDepsFix.cpp
415 ↗	(On Diff #84518)	Don't forget to add the comment :)
test/CodeGen/X86/break-false-dep.ll
313 ↗	(On Diff #85013)	xmm7 --> xmm6 ?

This revision is now accepted and ready to land.Jan 24 2017, 6:25 AM

Closed by commit rL293571: [ExecutionDepsFix] Improve clearance calculation for loops (authored by kfischer). · Explain WhyJan 30 2017, 3:48 PM

This revision was automatically updated to reflect the committed changes.

mehdi_amini added a subscriber: bruno.Mar 7 2017, 10:45 PM

mehdi_amini added a subscriber: mehdi_amini.Mar 7 2017, 10:47 PM

mehdi_amini added inline comments.

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837	What is the limit on the depth of the stack? We're seeing a crash because of stack explosion here, so I fear it can grow with the CFG (which wouldn't seem reasonable to me). Can you comment on this?

mehdi_amini added inline comments.Mar 7 2017, 10:49 PM

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837	Note: I haven't spent time figuring out what `ExeDepsFix` is doing, don't assume I have any context. The crash we're tracking is a ThinLTO bootstrap failure, we're still working on the exact reproducer.

loladiro added inline comments.Mar 7 2017, 10:56 PM

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837	Yes, I suppose it can grow with the number of nested loops. Must be quite a reproducer to cause this problem though. Unless of course there's something more fundamental wrong with the logic here (though for that I'd need the reproducer). In any case, it would be fine to change this to keep a working set in a SmallVector or something equivalent.

Yes, I suppose it can grow with the number of nested loops. Must be quite a reproducer to cause this problem though. Unless of course there's something more fundamental wrong with the logic here (though for that I'd need the reproducer). In any case, it would be fine to change this to keep a working set in a SmallVector or something equivalent.

This is a recursion over the control flow graph and you only need a single loop. Recursing over the structure of the program is a no-go in a compiler as your stack is always limited and comparatively small and the input can grow arbitrarily.

We are seeing this in a real stage2 build of clang and we have to fix the buildbot! If you need to have a reproducer update llvm to r298184, limit your stack to 512kb (ulimit -s 512) as that is what is currently used by ThinLTO and use this python snippet to make a reproducer:

n_blocks=4000

print '''
---
name: func
tracksRegLiveness: true
body: |
  bb.0:
    successors: %bb.1, %bb.{n_blocks}
    liveins: %xmm0
    NOOP implicit %xmm0
    JE_1 %bb.{n_blocks}, implicit undef %eflags
    JMP_1 %bb.1
'''.format(**locals())

for i in range(1, n_blocks):
    print '  bb.%s:' % (i)
    if i < n_blocks-1:
        print '    successors: %%bb.%s, %%bb.%s' % (i+1, n_blocks)
        print '    JE_1 %%bb.%s, implicit undef %%eflags' % (i+1)
    else:
        print '    successors: %%bb.%s' % (n_blocks)
    print '    JMP_1 %%bb.%s' % n_blocks

print '''
  bb.{n_blocks}:
    RETQ undef %eax
'''.format(**locals())

In D28759#704582, @MatzeB wrote:

We are seeing this in a real stage2 build of clang and we have to fix the buildbot!

FYI this is the failing job right now : http://green.lab.llvm.org/green/view/Clang/job/clang-stage2-Rthinlto/

Can you try https://reviews.llvm.org/D31681? I wasn't able to reproduce the problem with your example, it kept crashing other parts of the compiler ;).

In D28759#718506, @loladiro wrote:

Can you try https://reviews.llvm.org/D31681? I wasn't able to reproduce the problem with your example, it kept crashing other parts of the compiler ;).

I think after we saw the stackoverflow in the pass on the build I did all my further testing with "llc -run-pass=x86-execution-deps-fix" only so I didn't see the other passes. Anyway the build seems to be back to green now. Thanks!

Diff 86362

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp

Show First 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	class ExeDepsFix : public MachineFunctionPass {
const TargetRegisterClass *const RC;		const TargetRegisterClass *const RC;
MachineFunction *MF;		MachineFunction *MF;
const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
RegisterClassInfo RegClassInfo;		RegisterClassInfo RegClassInfo;
std::vector<SmallVector<int, 1>> AliasMap;		std::vector<SmallVector<int, 1>> AliasMap;
const unsigned NumRegs;		const unsigned NumRegs;
LiveReg *LiveRegs;		LiveReg *LiveRegs;
typedef DenseMap<MachineBasicBlock, LiveReg> LiveOutMap;		struct MBBInfo {
LiveOutMap LiveOuts;		// Keeps clearance and domain information for all registers. Note that this
		// is different from the usual definition notion of liveness. The CPU
		// doesn't care whether or not we consider a register killed.
		LiveReg *OutRegs;

		// Whether we have gotten to this block in primary processing yet.
		bool PrimaryCompleted;

		// The number of predecessors for which primary processing has completed
		unsigned IncomingProcessed;

		// The value of `IncomingProcessed` at the start of primary processing
		unsigned PrimaryIncoming;

		// The number of predecessors for which all processing steps are done.
		unsigned IncomingCompleted;

		MBBInfo()
		: OutRegs(nullptr), PrimaryCompleted(false), IncomingProcessed(0),
		PrimaryIncoming(0), IncomingCompleted(0) {}
		};
		typedef DenseMap<MachineBasicBlock *, MBBInfo> MBBInfoMap;
		MBBInfoMap MBBInfos;

/// List of undefined register reads in this block in forward order.		/// List of undefined register reads in this block in forward order.
std::vector<std::pair<MachineInstr*, unsigned> > UndefReads;		std::vector<std::pair<MachineInstr*, unsigned> > UndefReads;

/// Storage for register unit liveness.		/// Storage for register unit liveness.
LivePhysRegs LiveRegSet;		LivePhysRegs LiveRegSet;

/// Current instruction number.		/// Current instruction number.
/// The first instruction in each basic block is 0.		/// The first instruction in each basic block is 0.
int CurInstr;		int CurInstr;

/// True when the current block has a predecessor that hasn't been visited
/// yet.
bool SeenUnknownBackEdge;

public:		public:
ExeDepsFix(const TargetRegisterClass *rc)		ExeDepsFix(const TargetRegisterClass *rc)
: MachineFunctionPass(ID), RC(rc), NumRegs(RC->getNumRegs()) {}		: MachineFunctionPass(ID), RC(rc), NumRegs(RC->getNumRegs()) {}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesAll();		AU.setPreservesAll();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

MachineFunctionProperties getRequiredProperties() const override {		MachineFunctionProperties getRequiredProperties() const override {
return MachineFunctionProperties().set(		return MachineFunctionProperties().set(
MachineFunctionProperties::Property::NoVRegs);		MachineFunctionProperties::Property::NoVRegs);
}		}

StringRef getPassName() const override { return "Execution dependency fix"; }		StringRef getPassName() const override { return "Execution dependency fix"; }

private:		private:
iterator_range<SmallVectorImpl<int>::const_iterator>		iterator_range<SmallVectorImpl<int>::const_iterator>
regIndices(unsigned Reg) const;		regIndices(unsigned Reg) const;

// DomainValue allocation.		// DomainValue allocation.
DomainValue *alloc(int domain = -1);		DomainValue *alloc(int domain = -1);
DomainValue retain(DomainValue DV) {		DomainValue retain(DomainValue DV) {
if (DV) ++DV->Refs;		if (DV) ++DV->Refs;
return DV;		return DV;
}		}
void release(DomainValue*);		void release(DomainValue*);
DomainValue resolve(DomainValue&);		DomainValue resolve(DomainValue&);

// LiveRegs manipulations.		// LiveRegs manipulations.
void setLiveReg(int rx, DomainValue *DV);		void setLiveReg(int rx, DomainValue *DV);
void kill(int rx);		void kill(int rx);
void force(int rx, unsigned domain);		void force(int rx, unsigned domain);
void collapse(DomainValue *dv, unsigned domain);		void collapse(DomainValue *dv, unsigned domain);
bool merge(DomainValue A, DomainValue B);		bool merge(DomainValue A, DomainValue B);

void enterBasicBlock(MachineBasicBlock*);		void enterBasicBlock(MachineBasicBlock*);
void leaveBasicBlock(MachineBasicBlock*);		void leaveBasicBlock(MachineBasicBlock*);
void visitInstr(MachineInstr*);		bool isBlockDone(MachineBasicBlock *);
void processDefs(MachineInstr*, bool Kill);		void processBasicBlock(MachineBasicBlock *MBB, bool PrimaryPass);
		void updateSuccessors(MachineBasicBlock *MBB, bool PrimaryPass);
		bool visitInstr(MachineInstr *);
		void processDefs(MachineInstr *, bool breakDependency, bool Kill);
void visitSoftInstr(MachineInstr*, unsigned mask);		void visitSoftInstr(MachineInstr*, unsigned mask);
void visitHardInstr(MachineInstr*, unsigned domain);		void visitHardInstr(MachineInstr*, unsigned domain);
void pickBestRegisterForUndef(MachineInstr *MI, unsigned OpIdx,		void pickBestRegisterForUndef(MachineInstr *MI, unsigned OpIdx,
unsigned Pref);		unsigned Pref);
bool shouldBreakDependence(MachineInstr*, unsigned OpIdx, unsigned Pref);		bool shouldBreakDependence(MachineInstr*, unsigned OpIdx, unsigned Pref);
void processUndefReads(MachineBasicBlock*);		void processUndefReads(MachineBasicBlock*);
};		};
}		}
▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	for (unsigned rx = 0; rx != NumRegs; ++rx) {
if (LiveRegs[rx].Value == B)		if (LiveRegs[rx].Value == B)
setLiveReg(rx, A);		setLiveReg(rx, A);
}		}
return true;		return true;
}		}

/// Set up LiveRegs by merging predecessor live-out values.		/// Set up LiveRegs by merging predecessor live-out values.
void ExeDepsFix::enterBasicBlock(MachineBasicBlock *MBB) {		void ExeDepsFix::enterBasicBlock(MachineBasicBlock *MBB) {
// Detect back-edges from predecessors we haven't processed yet.
SeenUnknownBackEdge = false;

// Reset instruction counter in each basic block.		// Reset instruction counter in each basic block.
CurInstr = 0;		CurInstr = 0;

// Set up UndefReads to track undefined register reads.		// Set up UndefReads to track undefined register reads.
UndefReads.clear();		UndefReads.clear();
LiveRegSet.clear();		LiveRegSet.clear();

// Set up LiveRegs to represent registers entering MBB.		// Set up LiveRegs to represent registers entering MBB.
Show All 18 Lines	if (MBB->pred_empty()) {
}		}
DEBUG(dbgs() << "BB#" << MBB->getNumber() << ": entry\n");		DEBUG(dbgs() << "BB#" << MBB->getNumber() << ": entry\n");
return;		return;
}		}

// Try to coalesce live-out registers from predecessors.		// Try to coalesce live-out registers from predecessors.
for (MachineBasicBlock::const_pred_iterator pi = MBB->pred_begin(),		for (MachineBasicBlock::const_pred_iterator pi = MBB->pred_begin(),
pe = MBB->pred_end(); pi != pe; ++pi) {		pe = MBB->pred_end(); pi != pe; ++pi) {
LiveOutMap::const_iterator fi = LiveOuts.find(*pi);		auto fi = MBBInfos.find(*pi);
if (fi == LiveOuts.end()) {		assert(fi != MBBInfos.end() &&
SeenUnknownBackEdge = true;		"Should have pre-allocated MBBInfos for all MBBs");
		LiveReg *Incoming = fi->second.OutRegs;
		// Incoming is null if this is a backedge from a BB
		// we haven't processed yet
		if (Incoming == nullptr) {
continue;		continue;
}		}
assert(fi->second && "Can't have NULL entries");

for (unsigned rx = 0; rx != NumRegs; ++rx) {		for (unsigned rx = 0; rx != NumRegs; ++rx) {
// Use the most recent predecessor def for each register.		// Use the most recent predecessor def for each register.
LiveRegs[rx].Def = std::max(LiveRegs[rx].Def, fi->second[rx].Def);		LiveRegs[rx].Def = std::max(LiveRegs[rx].Def, Incoming[rx].Def);

DomainValue *pdv = resolve(fi->second[rx].Value);		DomainValue *pdv = resolve(Incoming[rx].Value);
if (!pdv)		if (!pdv)
continue;		continue;
if (!LiveRegs[rx].Value) {		if (!LiveRegs[rx].Value) {
setLiveReg(rx, pdv);		setLiveReg(rx, pdv);
continue;		continue;
}		}

// We have a live DomainValue from more than one predecessor.		// We have a live DomainValue from more than one predecessor.
if (LiveRegs[rx].Value->isCollapsed()) {		if (LiveRegs[rx].Value->isCollapsed()) {
// We are already collapsed, but predecessor is not. Force it.		// We are already collapsed, but predecessor is not. Force it.
unsigned Domain = LiveRegs[rx].Value->getFirstDomain();		unsigned Domain = LiveRegs[rx].Value->getFirstDomain();
if (!pdv->isCollapsed() && pdv->hasDomain(Domain))		if (!pdv->isCollapsed() && pdv->hasDomain(Domain))
collapse(pdv, Domain);		collapse(pdv, Domain);
continue;		continue;
}		}

// Currently open, merge in predecessor.		// Currently open, merge in predecessor.
if (!pdv->isCollapsed())		if (!pdv->isCollapsed())
merge(LiveRegs[rx].Value, pdv);		merge(LiveRegs[rx].Value, pdv);
else		else
force(rx, pdv->getFirstDomain());		force(rx, pdv->getFirstDomain());
}		}
}		}
DEBUG(dbgs() << "BB#" << MBB->getNumber()		DEBUG(
<< (SeenUnknownBackEdge ? ": incomplete\n" : ": all preds known\n"));		dbgs() << "BB#" << MBB->getNumber()
		<< (!isBlockDone(MBB) ? ": incomplete\n" : ": all preds known\n"));
}		}

void ExeDepsFix::leaveBasicBlock(MachineBasicBlock *MBB) {		void ExeDepsFix::leaveBasicBlock(MachineBasicBlock *MBB) {
assert(LiveRegs && "Must enter basic block first.");		assert(LiveRegs && "Must enter basic block first.");
// Save live registers at end of MBB - used by enterBasicBlock().		LiveReg *OldOutRegs = MBBInfos[MBB].OutRegs;
// Also use LiveOuts as a visited set to detect back-edges.		// Save register clearances at end of MBB - used by enterBasicBlock().
bool First = LiveOuts.insert(std::make_pair(MBB, LiveRegs)).second;		MBBInfos[MBB].OutRegs = LiveRegs;

if (First) {		// While processing the basic block, we kept `Def` relative to the start
// LiveRegs was inserted in LiveOuts. Adjust all defs to be relative to		// of the basic block for convenience. However, future use of this information
// the end of this block instead of the beginning.		// only cares about the clearance from the end of the block, so adjust
		// everything to be relative to the end of the basic block.
for (unsigned i = 0, e = NumRegs; i != e; ++i)		for (unsigned i = 0, e = NumRegs; i != e; ++i)
LiveRegs[i].Def -= CurInstr;		LiveRegs[i].Def -= CurInstr;
} else {		if (OldOutRegs) {
// Insertion failed, this must be the second pass.		// This must be the second pass.
// Release all the DomainValues instead of keeping them.		// Release all the DomainValues instead of keeping them.
for (unsigned i = 0, e = NumRegs; i != e; ++i)		for (unsigned i = 0, e = NumRegs; i != e; ++i)
release(LiveRegs[i].Value);		release(OldOutRegs[i].Value);
delete[] LiveRegs;		delete[] OldOutRegs;
}		}
LiveRegs = nullptr;		LiveRegs = nullptr;
}		}

void ExeDepsFix::visitInstr(MachineInstr *MI) {		bool ExeDepsFix::visitInstr(MachineInstr *MI) {
if (MI->isDebugValue())
return;

// Update instructions with explicit execution domains.		// Update instructions with explicit execution domains.
std::pair<uint16_t, uint16_t> DomP = TII->getExecutionDomain(*MI);		std::pair<uint16_t, uint16_t> DomP = TII->getExecutionDomain(*MI);
if (DomP.first) {		if (DomP.first) {
if (DomP.second)		if (DomP.second)
visitSoftInstr(MI, DomP.second);		visitSoftInstr(MI, DomP.second);
else		else
visitHardInstr(MI, DomP.first);		visitHardInstr(MI, DomP.first);
}		}

// Process defs to track register ages, and kill values clobbered by generic		return !DomP.first;
// instructions.
processDefs(MI, !DomP.first);
}		}

/// \brief Helps avoid false dependencies on undef registers by updating the		/// \brief Helps avoid false dependencies on undef registers by updating the
/// machine instructions' undef operand to use a register that the instruction		/// machine instructions' undef operand to use a register that the instruction
/// is truly dependent on, or use a register with clearance higher than Pref.		/// is truly dependent on, or use a register with clearance higher than Pref.
void ExeDepsFix::pickBestRegisterForUndef(MachineInstr *MI, unsigned OpIdx,		void ExeDepsFix::pickBestRegisterForUndef(MachineInstr *MI, unsigned OpIdx,
unsigned Pref) {		unsigned Pref) {
MachineOperand &MO = MI->getOperand(OpIdx);		MachineOperand &MO = MI->getOperand(OpIdx);
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	bool ExeDepsFix::shouldBreakDependence(MachineInstr *MI, unsigned OpIdx,
for (int rx : regIndices(reg)) {		for (int rx : regIndices(reg)) {
unsigned Clearance = CurInstr - LiveRegs[rx].Def;		unsigned Clearance = CurInstr - LiveRegs[rx].Def;
DEBUG(dbgs() << "Clearance: " << Clearance << ", want " << Pref);		DEBUG(dbgs() << "Clearance: " << Clearance << ", want " << Pref);

if (Pref > Clearance) {		if (Pref > Clearance) {
DEBUG(dbgs() << ": Break dependency.\n");		DEBUG(dbgs() << ": Break dependency.\n");
continue;		continue;
}		}
// The current clearance seems OK, but we may be ignoring a def from a
// back-edge.
if (!SeenUnknownBackEdge \|\| Pref <= unsigned(CurInstr)) {
DEBUG(dbgs() << ": OK .\n");		DEBUG(dbgs() << ": OK .\n");
return false;		return false;
}		}
// A def from an unprocessed back-edge may make us break this dependency.
DEBUG(dbgs() << ": Wait for back-edge to resolve.\n");
return false;
}
return true;		return true;
}		}

// Update def-ages for registers defined by MI.		// Update def-ages for registers defined by MI.
// If Kill is set, also kill off DomainValues clobbered by the defs.		// If Kill is set, also kill off DomainValues clobbered by the defs.
//		//
// Also break dependencies on partial defs and undef uses.		// Also break dependencies on partial defs and undef uses.
void ExeDepsFix::processDefs(MachineInstr *MI, bool Kill) {		void ExeDepsFix::processDefs(MachineInstr *MI, bool breakDependency,
		bool Kill) {
assert(!MI->isDebugValue() && "Won't process debug values");		assert(!MI->isDebugValue() && "Won't process debug values");

// Break dependence on undef uses. Do this before updating LiveRegs below.		// Break dependence on undef uses. Do this before updating LiveRegs below.
unsigned OpNum;		unsigned OpNum;
		if (breakDependency) {
unsigned Pref = TII->getUndefRegClearance(*MI, OpNum, TRI);		unsigned Pref = TII->getUndefRegClearance(*MI, OpNum, TRI);
if (Pref) {		if (Pref) {
pickBestRegisterForUndef(MI, OpNum, Pref);		pickBestRegisterForUndef(MI, OpNum, Pref);
if (shouldBreakDependence(MI, OpNum, Pref))		if (shouldBreakDependence(MI, OpNum, Pref))
UndefReads.push_back(std::make_pair(MI, OpNum));		UndefReads.push_back(std::make_pair(MI, OpNum));
}		}
		}
const MCInstrDesc &MCID = MI->getDesc();		const MCInstrDesc &MCID = MI->getDesc();
for (unsigned i = 0,		for (unsigned i = 0,
e = MI->isVariadic() ? MI->getNumOperands() : MCID.getNumDefs();		e = MI->isVariadic() ? MI->getNumOperands() : MCID.getNumDefs();
i != e; ++i) {		i != e; ++i) {
MachineOperand &MO = MI->getOperand(i);		MachineOperand &MO = MI->getOperand(i);
if (!MO.isReg())		if (!MO.isReg())
continue;		continue;
if (MO.isUse())		if (MO.isUse())
continue;		continue;
for (int rx : regIndices(MO.getReg())) {		for (int rx : regIndices(MO.getReg())) {
// This instruction explicitly defines rx.		// This instruction explicitly defines rx.
DEBUG(dbgs() << TRI->getName(RC->getRegister(rx)) << ":\t" << CurInstr		DEBUG(dbgs() << TRI->getName(RC->getRegister(rx)) << ":\t" << CurInstr
<< '\t' << *MI);		<< '\t' << *MI);

		if (breakDependency) {
// Check clearance before partial register updates.		// Check clearance before partial register updates.
// Call breakDependence before setting LiveRegs[rx].Def.		// Call breakDependence before setting LiveRegs[rx].Def.
unsigned Pref = TII->getPartialRegUpdateClearance(*MI, i, TRI);		unsigned Pref = TII->getPartialRegUpdateClearance(*MI, i, TRI);
if (Pref && shouldBreakDependence(MI, i, Pref))		if (Pref && shouldBreakDependence(MI, i, Pref))
TII->breakPartialRegDependency(*MI, i, TRI);		TII->breakPartialRegDependency(*MI, i, TRI);
		}

// How many instructions since rx was last written?		// How many instructions since rx was last written?
LiveRegs[rx].Def = CurInstr;		LiveRegs[rx].Def = CurInstr;

// Kill off domains redefined by generic instructions.		// Kill off domains redefined by generic instructions.
if (Kill)		if (Kill)
kill(rx);		kill(rx);
}		}
▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	for (int rx : regIndices(mo.getReg())) {
if (!LiveRegs[rx].Value \|\| (mo.isDef() && LiveRegs[rx].Value != dv)) {		if (!LiveRegs[rx].Value \|\| (mo.isDef() && LiveRegs[rx].Value != dv)) {
kill(rx);		kill(rx);
setLiveReg(rx, dv);		setLiveReg(rx, dv);
}		}
}		}
}		}
}		}

		void ExeDepsFix::processBasicBlock(MachineBasicBlock *MBB, bool PrimaryPass) {
		enterBasicBlock(MBB);
		// If this block is not done, it makes little sense to make any decisions
		// based on clearance information. We need to make a second pass anyway,
		// and by then we'll have better information, so we can avoid doing the work
		// to try and break dependencies now.
		bool breakDependency = isBlockDone(MBB);
		for (MachineInstr &MI : *MBB) {
		if (!MI.isDebugValue()) {
		bool Kill = false;
		if (PrimaryPass)
		Kill = visitInstr(&MI);
		processDefs(&MI, breakDependency, Kill);
		}
		}
		if (breakDependency)
		processUndefReads(MBB);
		leaveBasicBlock(MBB);
		}

		bool ExeDepsFix::isBlockDone(MachineBasicBlock *MBB) {
		return MBBInfos[MBB].PrimaryCompleted &&
		MBBInfos[MBB].IncomingCompleted == MBBInfos[MBB].PrimaryIncoming &&
		MBBInfos[MBB].IncomingProcessed == MBB->pred_size();
		}

		void ExeDepsFix::updateSuccessors(MachineBasicBlock *MBB, bool Primary) {
		bool Done = isBlockDone(MBB);
		for (auto *Succ : MBB->successors()) {
		if (!isBlockDone(Succ)) {
		if (Primary) {
		MBBInfos[Succ].IncomingProcessed++;
		}
		if (Done) {
		MBBInfos[Succ].IncomingCompleted++;
		}
		if (isBlockDone(Succ)) {
		// Perform secondary processing for this successor. See the big comment
		// in runOnMachineFunction, for an explanation of the iteration order.
		processBasicBlock(Succ, false);
		updateSuccessors(Succ, false);
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions What is the limit on the depth of the stack? We're seeing a crash because of stack explosion here, so I fear it can grow with the CFG (which wouldn't seem reasonable to me). Can you comment on this? mehdi_amini: What is the limit on the depth of the stack? We're seeing a crash because of stack explosion…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Note: I haven't spent time figuring out what `ExeDepsFix` is doing, don't assume I have any context. The crash we're tracking is a ThinLTO bootstrap failure, we're still working on the exact reproducer. mehdi_amini: Note: I haven't spent time figuring out what `ExeDepsFix` is doing, don't assume I have any…
		loladiroAuthorUnsubmitted Not Done Reply Inline Actions Yes, I suppose it can grow with the number of nested loops. Must be quite a reproducer to cause this problem though. Unless of course there's something more fundamental wrong with the logic here (though for that I'd need the reproducer). In any case, it would be fine to change this to keep a working set in a SmallVector or something equivalent. loladiro: Yes, I suppose it can grow with the number of nested loops. Must be quite a reproducer to cause…
		}
		}
		}
		}

bool ExeDepsFix::runOnMachineFunction(MachineFunction &mf) {		bool ExeDepsFix::runOnMachineFunction(MachineFunction &mf) {
if (skipFunction(*mf.getFunction()))		if (skipFunction(*mf.getFunction()))
return false;		return false;
MF = &mf;		MF = &mf;
TII = MF->getSubtarget().getInstrInfo();		TII = MF->getSubtarget().getInstrInfo();
TRI = MF->getSubtarget().getRegisterInfo();		TRI = MF->getSubtarget().getRegisterInfo();
RegClassInfo.runOnMachineFunction(mf);		RegClassInfo.runOnMachineFunction(mf);
LiveRegs = nullptr;		LiveRegs = nullptr;
Show All 20 Lines	if (AliasMap.empty()) {
// therefore the LiveRegs array.		// therefore the LiveRegs array.
AliasMap.resize(TRI->getNumRegs());		AliasMap.resize(TRI->getNumRegs());
for (unsigned i = 0, e = RC->getNumRegs(); i != e; ++i)		for (unsigned i = 0, e = RC->getNumRegs(); i != e; ++i)
for (MCRegAliasIterator AI(RC->getRegister(i), TRI, true);		for (MCRegAliasIterator AI(RC->getRegister(i), TRI, true);
AI.isValid(); ++AI)		AI.isValid(); ++AI)
AliasMap[*AI].push_back(i);		AliasMap[*AI].push_back(i);
}		}

		// Initialize the MMBInfos
		for (auto &MBB : mf) {
		MBBInfo InitialInfo;
		MBBInfos.insert(std::make_pair(&MBB, InitialInfo));
		}

		/*
		* We want to visit every instruction in every basic block in order to update
		* it's execution domain or break any false dependencies. However, for the
		* dependency breaking, we need to know clearances from all predecessors
		* (including any backedges). One way to do so would be to do two complete
		* passes over all basic blocks/instructions, the first for recording
		* clearances, the second to break the dependencies. However, for functions
		* without backedges, or functions with a lot of straight-line code, and
		* a small loop, that would be a lot of unnecessary work (since only the
		* BBs that are part of the loop require two passes). As an example,
		* consider the following loop.
		*
		*
		* PH -> A -> B (xmm<Undef> -> xmm<Def>) -> C -> D -> EXIT
		* ^ \|
		* +----------------------------------+
		*
		* The iteration order is as follows:
		* Naive: PH A B C D A' B' C' D'
		* Optimized: PH A B C A' B' C' D
		*
		* Note that we avoid processing D twice, because we can entirely process
		* the predecessors before getting to D. We call a block that is ready
		* for its second round of processing `done` (isBlockDone). Once we finish
		* processing some block, we update the counters in MBBInfos and re-process
		* any successors that are now done.
		*/

MachineBasicBlock Entry = &MF->begin();		MachineBasicBlock Entry = &MF->begin();
ReversePostOrderTraversal<MachineBasicBlock*> RPOT(Entry);		ReversePostOrderTraversal<MachineBasicBlock*> RPOT(Entry);
SmallVector<MachineBasicBlock*, 16> Loops;
for (ReversePostOrderTraversal<MachineBasicBlock*>::rpo_iterator		for (ReversePostOrderTraversal<MachineBasicBlock*>::rpo_iterator
MBBI = RPOT.begin(), MBBE = RPOT.end(); MBBI != MBBE; ++MBBI) {		MBBI = RPOT.begin(), MBBE = RPOT.end(); MBBI != MBBE; ++MBBI) {
MachineBasicBlock MBB = MBBI;		MachineBasicBlock MBB = MBBI;
enterBasicBlock(MBB);		// N.B: IncomingProcessed and IncomingCompleted were already updated while
if (SeenUnknownBackEdge)		// processing this block's predecessors.
Loops.push_back(MBB);		MBBInfos[MBB].PrimaryCompleted = true;
for (MachineInstr &MI : *MBB)		MBBInfos[MBB].PrimaryIncoming = MBBInfos[MBB].IncomingProcessed;
visitInstr(&MI);		processBasicBlock(MBB, true);
processUndefReads(MBB);		updateSuccessors(MBB, true);
leaveBasicBlock(MBB);
}		}

// Visit all the loop blocks again in order to merge DomainValues from		// We need to go through again and finalize any blocks that are not done yet.
// back-edges.		// This is possible if blocks have dead predecessors, so we didn't visit them
for (MachineBasicBlock *MBB : Loops) {		// above.
enterBasicBlock(MBB);		for (ReversePostOrderTraversal<MachineBasicBlock *>::rpo_iterator
for (MachineInstr &MI : *MBB)		MBBI = RPOT.begin(),
if (!MI.isDebugValue())		MBBE = RPOT.end();
processDefs(&MI, false);		MBBI != MBBE; ++MBBI) {
processUndefReads(MBB);		MachineBasicBlock MBB = MBBI;
leaveBasicBlock(MBB);		if (!isBlockDone(MBB)) {
		processBasicBlock(MBB, false);
		// Don't update successors here. We'll get to them anyway through this
		// loop.
		}
}		}

// Clear the LiveOuts vectors and collapse any remaining DomainValues.		// Clear the LiveOuts vectors and collapse any remaining DomainValues.
for (ReversePostOrderTraversal<MachineBasicBlock*>::rpo_iterator		for (ReversePostOrderTraversal<MachineBasicBlock*>::rpo_iterator
MBBI = RPOT.begin(), MBBE = RPOT.end(); MBBI != MBBE; ++MBBI) {		MBBI = RPOT.begin(), MBBE = RPOT.end(); MBBI != MBBE; ++MBBI) {
LiveOutMap::const_iterator FI = LiveOuts.find(*MBBI);		auto FI = MBBInfos.find(*MBBI);
if (FI == LiveOuts.end() \|\| !FI->second)		if (FI == MBBInfos.end() \|\| !FI->second.OutRegs)
continue;		continue;
for (unsigned i = 0, e = NumRegs; i != e; ++i)		for (unsigned i = 0, e = NumRegs; i != e; ++i)
if (FI->second[i].Value)		if (FI->second.OutRegs[i].Value)
release(FI->second[i].Value);		release(FI->second.OutRegs[i].Value);
delete[] FI->second;		delete[] FI->second.OutRegs;
}		}
LiveOuts.clear();		MBBInfos.clear();
UndefReads.clear();		UndefReads.clear();
Avail.clear();		Avail.clear();
Allocator.DestroyAll();		Allocator.DestroyAll();

return false;		return false;
}		}

FunctionPass *		FunctionPass *
llvm::createExecutionDependencyFixPass(const TargetRegisterClass *RC) {		llvm::createExecutionDependencyFixPass(const TargetRegisterClass *RC) {
return new ExeDepsFix(RC);		return new ExeDepsFix(RC);
}		}

llvm/trunk/test/CodeGen/X86/break-false-dep.ll

	Show First 20 Lines • Show All 271 Lines • ▼ Show 20 Lines
	ret:			ret:
	ret i64 %s2			ret i64 %s2
	;AVX-LABEL:@loopclearence			;AVX-LABEL:@loopclearence
	;Registers 4-7 are not used and therefore one of them should be chosen			;Registers 4-7 are not used and therefore one of them should be chosen
	;AVX-NOT: {{%xmm[4-7]}}			;AVX-NOT: {{%xmm[4-7]}}
	;AVX: vcvtsi2sdq {{.*}}, [[XMM4_7:%xmm[4-7]]], {{%xmm[0-9]+}}			;AVX: vcvtsi2sdq {{.*}}, [[XMM4_7:%xmm[4-7]]], {{%xmm[0-9]+}}
	;AVX-NOT: [[XMM4_7]]			;AVX-NOT: [[XMM4_7]]
	}			}

				; Make sure we are making a smart choice regarding undef registers even for more
				; complicated loop structures. This example is the inner loop from
				; julia> a = falses(10000); a[1:4:end] = true
				; julia> linspace(1.0,2.0,10000)[a]
				define void @loopclearance2(double* nocapture %y, i64* %x, double %c1, double %c2, double %c3, double %c4, i64 %size) {
				entry:
				tail call void asm sideeffect "", "~{xmm7},~{dirflag},~{fpsr},~{flags}"()
				tail call void asm sideeffect "", "~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{dirflag},~{fpsr},~{flags}"()
				tail call void asm sideeffect "", "~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{dirflag},~{fpsr},~{flags}"()
				br label %loop

				loop:
				%phi_i = phi i64 [ 1, %entry ], [ %nexti, %loop_end ]
				%phi_j = phi i64 [ 1, %entry ], [ %nextj, %loop_end ]
				%phi_k = phi i64 [ 0, %entry ], [ %nextk, %loop_end ]
				br label %inner_loop

				inner_loop:
				%phi = phi i64 [ %phi_k, %loop ], [ %nextk, %inner_loop ]
				%idx = lshr i64 %phi, 6
				%inputptr = getelementptr i64, i64* %x, i64 %idx
				%input = load i64, i64* %inputptr, align 8
				%masked = and i64 %phi, 63
				%shiftedmasked = shl i64 1, %masked
				%maskedinput = and i64 %input, %shiftedmasked
				%cmp = icmp eq i64 %maskedinput, 0
				%nextk = add i64 %phi, 1
				br i1 %cmp, label %inner_loop, label %loop_end

				loop_end:
				%nexti = add i64 %phi_i, 1
				%nextj = add i64 %phi_j, 1
				; Register use, plus us clobbering 7-15 above, basically forces xmm6 here as
				; the only reasonable choice. The primary thing we care about is that it's
				; not one of the registers used in the loop (e.g. not the output reg here)
				;AVX-NOT: %xmm6
				;AVX: vcvtsi2sdq {{.*}}, %xmm6, {{%xmm[0-9]+}}
				;AVX-NOT: %xmm6
				%nexti_f = sitofp i64 %nexti to double
				%sub = fsub double %c1, %nexti_f
				%mul = fmul double %sub, %c2
				;AVX: vcvtsi2sdq {{.*}}, %xmm6, {{%xmm[0-9]+}}
				;AVX-NOT: %xmm6
				%phi_f = sitofp i64 %phi to double
				%mul2 = fmul double %phi_f, %c3
				%add2 = fadd double %mul, %mul2
				%div = fdiv double %add2, %c4
				%prev_j = add i64 %phi_j, -1
				%outptr = getelementptr double, double* %y, i64 %prev_j
				store double %div, double* %outptr, align 8
				%done = icmp slt i64 %size, %nexti
				br i1 %done, label %loopdone, label %loop

				loopdone:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[ExecutionDepsFix] Improve clearance calculation for loops
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 86362

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp

llvm/trunk/test/CodeGen/X86/break-false-dep.ll

This is an archive of the discontinued LLVM Phabricator instance.

[ExecutionDepsFix] Improve clearance calculation for loopsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 86362

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp

llvm/trunk/test/CodeGen/X86/break-false-dep.ll

[ExecutionDepsFix] Improve clearance calculation for loops
ClosedPublic