Download Raw Diff

Details

Reviewers

myatsina
MatzeB
atrick
mkuper

Commits

rG578cf7aae755: [ExecutionDepsFix] Improve clearance calculation for loops
rL293571: [ExecutionDepsFix] Improve clearance calculation for loops

Summary

In revision rL278321, ExecutionDepsFix learned how to pick a better
register for undef register reads, e.g. for instructions such as
vcvtsi2sdq. While this revision improved performance on a good number
of our benchmarks, it unfortunately also caused significant regressions
(up to 3x) on others. This regression turned out to be caused by loops
such as:

PH -> A -> B (xmm<Undef> -> xmm<Def>) -> C -> D -> EXIT
      ^                                  |
      +----------------------------------+

In the previous version of the clearance calculation, we would visit
the blocks in order, remembering for each whether there were any
incoming backedges from blocks that we hadn't processed yet and if
so queuing up the block to be re-processed. However, for loop structures
such as the above, this is clearly insufficient, since the block B
does not have any unknown backedges, so we do not see the false
dependency from the previous interation's Def of xmm registers in B.

To fix this, we need to consider all blocks that are part of the loop
and reprocess them one the correct clearance values are known. As
an optimization, we also want to avoid reprocessing any later blocks
that are not part of the loop.

In summary, the iteration order is as follows:
Before: PH A B C D A'
Corrected (Naive): PH A B C D A' B' C' D'
Corrected (w/ optimization): PH A B C A' B' C' D

To facilitate this optimization we introduce two new counters for each
basic block. The first counts how many of it's predecssors have
completed primary processing. The second counts how many of its
predecessors have completed all processing (we will call such a block
*done*. Now, the criteria to reprocess a block is as follows:

All Predecessors have completed primary processing
For x the number of predecessors that have completed primary processing *at the time of primary processing of this block*, the number of predecessors that are done has reached x.

The intuition behind this criterion is as follows:
We need to perform primary processing on all predecessors in order to
find out any direct defs in those predecessors. When predecessors are
done, we also know that we have information about indirect defs (e.g.
in block B though that were inherited through B->C->A->B). However,
we can't wait for all predecessors to be done, since that would
cause cyclic dependencies. However, it is guaranteed that all those
predecessors that are prior to us in reverse postorder will be done
before us. Since we iterate of the basic blocks in reverse postorder,
the number x above, is precisely the count of the number of predecessors
prior to us in reverse postorder.

Diff Detail

Build Status

Buildable 2952
Build 2952: arc lint + arc unit

Event Timeline

loladiro updated this revision to Diff 84518.Jan 15 2017, 9:32 PM

loladiro retitled this revision from to [ExecutionDepsFix] Improve clearance calculation for loops.

loladiro updated this object.

loladiro added reviewers: MatzeB, myatsina, mkuper, atrick.

loladiro added a subscriber: llvm-commits.

loladiro updated this object.Jan 15 2017, 9:33 PM

vchuravy added a subscriber: vchuravy.Jan 15 2017, 9:48 PM

sanjoy added a subscriber: sanjoy.Jan 15 2017, 9:49 PM

myatsina added inline comments.Jan 16 2017, 8:01 AM

lib/CodeGen/ExecutionDepsFix.cpp
146	Not --> Note ?
413	Can you add an assertion message please?
415	Is OutRegs null when it's a back edge from a BB we haven't seen yet? Are there other cases where it can be null? Please add a comment explaining the cases.
456	Some of the comments in this function need to be updated (we no longer have LiveOuts, we always change the defs to be ralive to the end of the block etc)
571	For readability purpose - how about changing "BlockDone" to "breakDependency" and add your comment regarding done blocks before the call to processDefs? processDefs() will look like this: if (breakDependency) { // calc Pref ... } and processBasicBlock will look like this: // If this block is not done, it makes little sense ... bool breakDependency = isBlockDone(Done) processDefs(MI, breakDependency, ...)
792	Do you need "Done" here? I don't see you using it.
802	Is there a point going over it when we're not isBlockDone? processDefs pushes instructions into the undef read only when Done =true. processUndefReads does break if the undef reads are empty, but perhaps for readability purpose it's worth writing this explicitly.
819	Why not check isBasicBlockDone on MBB?
822	At first glance, it wasn't clear to me why you need it here and why can't you just do one "primary" pass and then then process the basic blocks that are still not done. If I understood correctly, you're doing it here for the optimization you've talked about (Making sure the order is: PH A B C A' B' C' D). Am I right? I would add a comment here elaborating that. I would even consider adding as a comment somewhere with the loop example from the description of this patch and the "optimized" order the algorithm visits the nodes. I think it is a great example and it will make the traverse order here much clearer.
868	Better use a constructor with default initialization like DomainValue does.
878	I would add a comment that IncomingProcessed and IncomingCompleted of this block were already updated during the processing of predecessor blocks.

loladiro mentioned this in D28786: [ExecutionDepsFix] Kill clearance at function entry/calls.Jan 16 2017, 3:49 PM

loladiro added a child revision: D28786: [ExecutionDepsFix] Kill clearance at function entry/calls.Jan 16 2017, 3:49 PM

Address review comments

Fix small typo in assertion message

lib/CodeGen/ExecutionDepsFix.cpp
415	Yes, that's correct. If it's null it's a back edge. Will add a comment to this extent.
819	Just to avoid doing the calculation twice. Probably premature optimization. Will simplify.
822	If I understood correctly, you're doing it here for the optimization you've talked about (Making sure the order is: PH A B C A' B' C' D). Am I right? Yes, that's correct. It's discussed below when going over the blocks that are not done. I'll add a small comment here. I will add the loop example from the commit message below and add a small comment here pointing people to the discussion below.

Forgot to run clang-format

loladiro added a child revision: D28915: [ExecutionDepsFix] Optimize instruction insertion.Jan 19 2017, 1:23 PM

Added a few minor comments.

LGTM once they are addressed.

lib/CodeGen/ExecutionDepsFix.cpp
415	Don't forget to add the comment :)
test/CodeGen/X86/break-false-dep.ll
313	xmm7 --> xmm6 ?

This revision is now accepted and ready to land.Jan 24 2017, 6:25 AM

Closed by commit rL293571: [ExecutionDepsFix] Improve clearance calculation for loops (authored by kfischer). · Explain WhyJan 30 2017, 3:48 PM

This revision was automatically updated to reflect the committed changes.

mehdi_amini added a subscriber: bruno.Mar 7 2017, 10:45 PM

mehdi_amini added a subscriber: mehdi_amini.Mar 7 2017, 10:47 PM

mehdi_amini added inline comments.

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837 ↗	(On Diff #86362)	What is the limit on the depth of the stack? We're seeing a crash because of stack explosion here, so I fear it can grow with the CFG (which wouldn't seem reasonable to me). Can you comment on this?

mehdi_amini added inline comments.Mar 7 2017, 10:49 PM

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837 ↗	(On Diff #86362)	Note: I haven't spent time figuring out what `ExeDepsFix` is doing, don't assume I have any context. The crash we're tracking is a ThinLTO bootstrap failure, we're still working on the exact reproducer.

loladiro added inline comments.Mar 7 2017, 10:56 PM

llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837 ↗	(On Diff #86362)	Yes, I suppose it can grow with the number of nested loops. Must be quite a reproducer to cause this problem though. Unless of course there's something more fundamental wrong with the logic here (though for that I'd need the reproducer). In any case, it would be fine to change this to keep a working set in a SmallVector or something equivalent.

Yes, I suppose it can grow with the number of nested loops. Must be quite a reproducer to cause this problem though. Unless of course there's something more fundamental wrong with the logic here (though for that I'd need the reproducer). In any case, it would be fine to change this to keep a working set in a SmallVector or something equivalent.

This is a recursion over the control flow graph and you only need a single loop. Recursing over the structure of the program is a no-go in a compiler as your stack is always limited and comparatively small and the input can grow arbitrarily.

We are seeing this in a real stage2 build of clang and we have to fix the buildbot! If you need to have a reproducer update llvm to r298184, limit your stack to 512kb (ulimit -s 512) as that is what is currently used by ThinLTO and use this python snippet to make a reproducer:

n_blocks=4000

print '''
---
name: func
tracksRegLiveness: true
body: |
  bb.0:
    successors: %bb.1, %bb.{n_blocks}
    liveins: %xmm0
    NOOP implicit %xmm0
    JE_1 %bb.{n_blocks}, implicit undef %eflags
    JMP_1 %bb.1
'''.format(**locals())

for i in range(1, n_blocks):
    print '  bb.%s:' % (i)
    if i < n_blocks-1:
        print '    successors: %%bb.%s, %%bb.%s' % (i+1, n_blocks)
        print '    JE_1 %%bb.%s, implicit undef %%eflags' % (i+1)
    else:
        print '    successors: %%bb.%s' % (n_blocks)
    print '    JMP_1 %%bb.%s' % n_blocks

print '''
  bb.{n_blocks}:
    RETQ undef %eax
'''.format(**locals())

In D28759#704582, @MatzeB wrote:

We are seeing this in a real stage2 build of clang and we have to fix the buildbot!

FYI this is the failing job right now : http://green.lab.llvm.org/green/view/Clang/job/clang-stage2-Rthinlto/

Can you try https://reviews.llvm.org/D31681? I wasn't able to reproduce the problem with your example, it kept crashing other parts of the compiler ;).

In D28759#718506, @loladiro wrote:

Can you try https://reviews.llvm.org/D31681? I wasn't able to reproduce the problem with your example, it kept crashing other parts of the compiler ;).

I think after we saw the stackoverflow in the pass on the build I did all my further testing with "llc -run-pass=x86-execution-deps-fix" only so I didn't see the other passes. Anyway the build seems to be back to green now. Thanks!

Diff 84518

lib/CodeGen/ExecutionDepsFix.cpp

Show First 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	class ExeDepsFix : public MachineFunctionPass {
const TargetRegisterClass *const RC;		const TargetRegisterClass *const RC;
MachineFunction *MF;		MachineFunction *MF;
const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
RegisterClassInfo RegClassInfo;		RegisterClassInfo RegClassInfo;
std::vector<SmallVector<int, 1>> AliasMap;		std::vector<SmallVector<int, 1>> AliasMap;
const unsigned NumRegs;		const unsigned NumRegs;
LiveReg *LiveRegs;		LiveReg *LiveRegs;
typedef DenseMap<MachineBasicBlock, LiveReg> LiveOutMap;		struct MBBInfo {
LiveOutMap LiveOuts;		// Keeps clearance and domain information for all registers. Not that this
		myatsinaUnsubmitted Done Reply Inline Actions Not --> Note ? myatsina: Not --> Note ?
		// is different from the usual definition notion of liveness. The CPU
		// doesn't care whether or not we consider a register killed.
		LiveReg *OutRegs;

		// Whether we have gotten to this block in primary processing yet.
		bool PrimaryCompleted;

		// The number of predecessors for which primary processing has completed
		unsigned IncomingProcessed;

		// The value of `IncomingProcessed` at the start of primary processing
		unsigned PrimaryIncoming;

		// The number of predecessors for which all processing steps are done.
		unsigned IncomingCompleted;
		};
		typedef DenseMap<MachineBasicBlock *, MBBInfo> MBBInfoMap;
		MBBInfoMap MBBInfos;

/// List of undefined register reads in this block in forward order.		/// List of undefined register reads in this block in forward order.
std::vector<std::pair<MachineInstr*, unsigned> > UndefReads;		std::vector<std::pair<MachineInstr*, unsigned> > UndefReads;

/// Storage for register unit liveness.		/// Storage for register unit liveness.
LivePhysRegs LiveRegSet;		LivePhysRegs LiveRegSet;

/// Current instruction number.		/// Current instruction number.
/// The first instruction in each basic block is 0.		/// The first instruction in each basic block is 0.
int CurInstr;		int CurInstr;

/// True when the current block has a predecessor that hasn't been visited
/// yet.
bool SeenUnknownBackEdge;

public:		public:
ExeDepsFix(const TargetRegisterClass *rc)		ExeDepsFix(const TargetRegisterClass *rc)
: MachineFunctionPass(ID), RC(rc), NumRegs(RC->getNumRegs()) {}		: MachineFunctionPass(ID), RC(rc), NumRegs(RC->getNumRegs()) {}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesAll();		AU.setPreservesAll();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

MachineFunctionProperties getRequiredProperties() const override {		MachineFunctionProperties getRequiredProperties() const override {
return MachineFunctionProperties().set(		return MachineFunctionProperties().set(
MachineFunctionProperties::Property::NoVRegs);		MachineFunctionProperties::Property::NoVRegs);
}		}

StringRef getPassName() const override { return "Execution dependency fix"; }		StringRef getPassName() const override { return "Execution dependency fix"; }

private:		private:
iterator_range<SmallVectorImpl<int>::const_iterator>		iterator_range<SmallVectorImpl<int>::const_iterator>
regIndices(unsigned Reg) const;		regIndices(unsigned Reg) const;

// DomainValue allocation.		// DomainValue allocation.
DomainValue *alloc(int domain = -1);		DomainValue *alloc(int domain = -1);
DomainValue retain(DomainValue DV) {		DomainValue retain(DomainValue DV) {
if (DV) ++DV->Refs;		if (DV) ++DV->Refs;
return DV;		return DV;
}		}
void release(DomainValue*);		void release(DomainValue*);
DomainValue resolve(DomainValue&);		DomainValue resolve(DomainValue&);

// LiveRegs manipulations.		// LiveRegs manipulations.
void setLiveReg(int rx, DomainValue *DV);		void setLiveReg(int rx, DomainValue *DV);
void kill(int rx);		void kill(int rx);
void force(int rx, unsigned domain);		void force(int rx, unsigned domain);
void collapse(DomainValue *dv, unsigned domain);		void collapse(DomainValue *dv, unsigned domain);
bool merge(DomainValue A, DomainValue B);		bool merge(DomainValue A, DomainValue B);

void enterBasicBlock(MachineBasicBlock*);		void enterBasicBlock(MachineBasicBlock*);
void leaveBasicBlock(MachineBasicBlock*);		void leaveBasicBlock(MachineBasicBlock*);
void visitInstr(MachineInstr*);		bool isBlockDone(MachineBasicBlock *);
void processDefs(MachineInstr*, bool Kill);		void processBasicBlock(MachineBasicBlock *MBB, bool PrimaryPass, bool Done);
		void updateSuccessors(MachineBasicBlock *MBB, bool Primary, bool Done);
		bool visitInstr(MachineInstr *);
		void processDefs(MachineInstr *, bool BlockDone, bool Kill);
void visitSoftInstr(MachineInstr*, unsigned mask);		void visitSoftInstr(MachineInstr*, unsigned mask);
void visitHardInstr(MachineInstr*, unsigned domain);		void visitHardInstr(MachineInstr*, unsigned domain);
void pickBestRegisterForUndef(MachineInstr *MI, unsigned OpIdx,		void pickBestRegisterForUndef(MachineInstr *MI, unsigned OpIdx,
unsigned Pref);		unsigned Pref);
bool shouldBreakDependence(MachineInstr*, unsigned OpIdx, unsigned Pref);		bool shouldBreakDependence(MachineInstr*, unsigned OpIdx, unsigned Pref);
void processUndefReads(MachineBasicBlock*);		void processUndefReads(MachineBasicBlock*);
};		};
}		}
▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	for (unsigned rx = 0; rx != NumRegs; ++rx) {
if (LiveRegs[rx].Value == B)		if (LiveRegs[rx].Value == B)
setLiveReg(rx, A);		setLiveReg(rx, A);
}		}
return true;		return true;
}		}

/// Set up LiveRegs by merging predecessor live-out values.		/// Set up LiveRegs by merging predecessor live-out values.
void ExeDepsFix::enterBasicBlock(MachineBasicBlock *MBB) {		void ExeDepsFix::enterBasicBlock(MachineBasicBlock *MBB) {
// Detect back-edges from predecessors we haven't processed yet.
SeenUnknownBackEdge = false;

// Reset instruction counter in each basic block.		// Reset instruction counter in each basic block.
CurInstr = 0;		CurInstr = 0;

// Set up UndefReads to track undefined register reads.		// Set up UndefReads to track undefined register reads.
UndefReads.clear();		UndefReads.clear();
LiveRegSet.clear();		LiveRegSet.clear();

// Set up LiveRegs to represent registers entering MBB.		// Set up LiveRegs to represent registers entering MBB.
Show All 18 Lines	if (MBB->pred_empty()) {
}		}
DEBUG(dbgs() << "BB#" << MBB->getNumber() << ": entry\n");		DEBUG(dbgs() << "BB#" << MBB->getNumber() << ": entry\n");
return;		return;
}		}

// Try to coalesce live-out registers from predecessors.		// Try to coalesce live-out registers from predecessors.
for (MachineBasicBlock::const_pred_iterator pi = MBB->pred_begin(),		for (MachineBasicBlock::const_pred_iterator pi = MBB->pred_begin(),
pe = MBB->pred_end(); pi != pe; ++pi) {		pe = MBB->pred_end(); pi != pe; ++pi) {
LiveOutMap::const_iterator fi = LiveOuts.find(*pi);		auto fi = MBBInfos.find(*pi);
if (fi == LiveOuts.end()) {		assert(fi != MBBInfos.end());
		myatsinaUnsubmitted Done Reply Inline Actions Can you add an assertion message please? myatsina: Can you add an assertion message please?
SeenUnknownBackEdge = true;		LiveReg *Incoming = fi->second.OutRegs;
		if (Incoming == nullptr) {
		myatsinaUnsubmitted Done Reply Inline Actions Is OutRegs null when it's a back edge from a BB we haven't seen yet? Are there other cases where it can be null? Please add a comment explaining the cases. myatsina: Is OutRegs null when it's a back edge from a BB we haven't seen yet? Are there other cases…
		loladiroAuthorUnsubmitted Not Done Reply Inline Actions Yes, that's correct. If it's null it's a back edge. Will add a comment to this extent. loladiro: Yes, that's correct. If it's null it's a back edge. Will add a comment to this extent.
		myatsinaUnsubmitted Not Done Reply Inline Actions Don't forget to add the comment :) myatsina: Don't forget to add the comment :)
continue;		continue;
}		}
assert(fi->second && "Can't have NULL entries");

for (unsigned rx = 0; rx != NumRegs; ++rx) {		for (unsigned rx = 0; rx != NumRegs; ++rx) {
// Use the most recent predecessor def for each register.		// Use the most recent predecessor def for each register.
LiveRegs[rx].Def = std::max(LiveRegs[rx].Def, fi->second[rx].Def);		LiveRegs[rx].Def = std::max(LiveRegs[rx].Def, Incoming[rx].Def);

DomainValue *pdv = resolve(fi->second[rx].Value);		DomainValue *pdv = resolve(Incoming[rx].Value);
if (!pdv)		if (!pdv)
continue;		continue;
if (!LiveRegs[rx].Value) {		if (!LiveRegs[rx].Value) {
setLiveReg(rx, pdv);		setLiveReg(rx, pdv);
continue;		continue;
}		}

// We have a live DomainValue from more than one predecessor.		// We have a live DomainValue from more than one predecessor.
if (LiveRegs[rx].Value->isCollapsed()) {		if (LiveRegs[rx].Value->isCollapsed()) {
// We are already collapsed, but predecessor is not. Force it.		// We are already collapsed, but predecessor is not. Force it.
unsigned Domain = LiveRegs[rx].Value->getFirstDomain();		unsigned Domain = LiveRegs[rx].Value->getFirstDomain();
if (!pdv->isCollapsed() && pdv->hasDomain(Domain))		if (!pdv->isCollapsed() && pdv->hasDomain(Domain))
collapse(pdv, Domain);		collapse(pdv, Domain);
continue;		continue;
}		}

// Currently open, merge in predecessor.		// Currently open, merge in predecessor.
if (!pdv->isCollapsed())		if (!pdv->isCollapsed())
merge(LiveRegs[rx].Value, pdv);		merge(LiveRegs[rx].Value, pdv);
else		else
force(rx, pdv->getFirstDomain());		force(rx, pdv->getFirstDomain());
}		}
}		}
DEBUG(dbgs() << "BB#" << MBB->getNumber()		DEBUG(
<< (SeenUnknownBackEdge ? ": incomplete\n" : ": all preds known\n"));		dbgs() << "BB#" << MBB->getNumber()
		<< (!isBlockDone(MBB) ? ": incomplete\n" : ": all preds known\n"));
}		}

void ExeDepsFix::leaveBasicBlock(MachineBasicBlock *MBB) {		void ExeDepsFix::leaveBasicBlock(MachineBasicBlock *MBB) {
assert(LiveRegs && "Must enter basic block first.");		assert(LiveRegs && "Must enter basic block first.");
		LiveReg *OldOutRegs = MBBInfos[MBB].OutRegs;
// Save live registers at end of MBB - used by enterBasicBlock().		// Save live registers at end of MBB - used by enterBasicBlock().
// Also use LiveOuts as a visited set to detect back-edges.		// Also use LiveOuts as a visited set to detect back-edges.
		myatsinaUnsubmitted Done Reply Inline Actions Some of the comments in this function need to be updated (we no longer have LiveOuts, we always change the defs to be ralive to the end of the block etc) myatsina: Some of the comments in this function need to be updated (we no longer have LiveOuts, we always…
bool First = LiveOuts.insert(std::make_pair(MBB, LiveRegs)).second;		MBBInfos[MBB].OutRegs = LiveRegs;

if (First) {
// LiveRegs was inserted in LiveOuts. Adjust all defs to be relative to		// LiveRegs was inserted in LiveOuts. Adjust all defs to be relative to
// the end of this block instead of the beginning.		// the end of this block instead of the beginning.
for (unsigned i = 0, e = NumRegs; i != e; ++i)		for (unsigned i = 0, e = NumRegs; i != e; ++i)
LiveRegs[i].Def -= CurInstr;		LiveRegs[i].Def -= CurInstr;
} else {		if (OldOutRegs) {
// Insertion failed, this must be the second pass.		// This must be the second pass.
// Release all the DomainValues instead of keeping them.		// Release all the DomainValues instead of keeping them.
for (unsigned i = 0, e = NumRegs; i != e; ++i)		for (unsigned i = 0, e = NumRegs; i != e; ++i)
release(LiveRegs[i].Value);		release(OldOutRegs[i].Value);
delete[] LiveRegs;		delete[] OldOutRegs;
}		}
LiveRegs = nullptr;		LiveRegs = nullptr;
}		}

void ExeDepsFix::visitInstr(MachineInstr *MI) {		bool ExeDepsFix::visitInstr(MachineInstr *MI) {
if (MI->isDebugValue())
return;

// Update instructions with explicit execution domains.		// Update instructions with explicit execution domains.
std::pair<uint16_t, uint16_t> DomP = TII->getExecutionDomain(*MI);		std::pair<uint16_t, uint16_t> DomP = TII->getExecutionDomain(*MI);
if (DomP.first) {		if (DomP.first) {
if (DomP.second)		if (DomP.second)
visitSoftInstr(MI, DomP.second);		visitSoftInstr(MI, DomP.second);
else		else
visitHardInstr(MI, DomP.first);		visitHardInstr(MI, DomP.first);
}		}

// Process defs to track register ages, and kill values clobbered by generic		return !DomP.first;
// instructions.
processDefs(MI, !DomP.first);
}		}

/// \brief Helps avoid false dependencies on undef registers by updating the		/// \brief Helps avoid false dependencies on undef registers by updating the
/// machine instructions' undef operand to use a register that the instruction		/// machine instructions' undef operand to use a register that the instruction
/// is truly dependent on, or use a register with clearance higher than Pref.		/// is truly dependent on, or use a register with clearance higher than Pref.
void ExeDepsFix::pickBestRegisterForUndef(MachineInstr *MI, unsigned OpIdx,		void ExeDepsFix::pickBestRegisterForUndef(MachineInstr *MI, unsigned OpIdx,
unsigned Pref) {		unsigned Pref) {
MachineOperand &MO = MI->getOperand(OpIdx);		MachineOperand &MO = MI->getOperand(OpIdx);
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	bool ExeDepsFix::shouldBreakDependence(MachineInstr *MI, unsigned OpIdx,
for (int rx : regIndices(reg)) {		for (int rx : regIndices(reg)) {
unsigned Clearance = CurInstr - LiveRegs[rx].Def;		unsigned Clearance = CurInstr - LiveRegs[rx].Def;
DEBUG(dbgs() << "Clearance: " << Clearance << ", want " << Pref);		DEBUG(dbgs() << "Clearance: " << Clearance << ", want " << Pref);

if (Pref > Clearance) {		if (Pref > Clearance) {
DEBUG(dbgs() << ": Break dependency.\n");		DEBUG(dbgs() << ": Break dependency.\n");
continue;		continue;
}		}
// The current clearance seems OK, but we may be ignoring a def from a
// back-edge.
if (!SeenUnknownBackEdge \|\| Pref <= unsigned(CurInstr)) {
DEBUG(dbgs() << ": OK .\n");		DEBUG(dbgs() << ": OK .\n");
return false;		return false;
}		}
// A def from an unprocessed back-edge may make us break this dependency.
DEBUG(dbgs() << ": Wait for back-edge to resolve.\n");
return false;
}
return true;		return true;
}		}

// Update def-ages for registers defined by MI.		// Update def-ages for registers defined by MI.
// If Kill is set, also kill off DomainValues clobbered by the defs.		// If Kill is set, also kill off DomainValues clobbered by the defs.
//		//
// Also break dependencies on partial defs and undef uses.		// Also break dependencies on partial defs and undef uses.
void ExeDepsFix::processDefs(MachineInstr *MI, bool Kill) {		void ExeDepsFix::processDefs(MachineInstr *MI, bool BlockDone, bool Kill) {
assert(!MI->isDebugValue() && "Won't process debug values");		assert(!MI->isDebugValue() && "Won't process debug values");

// Break dependence on undef uses. Do this before updating LiveRegs below.		// Break dependence on undef uses. Do this before updating LiveRegs below.
unsigned OpNum;		unsigned OpNum;
		// If this block is not done, it makes little sense to make any decisions
		// based on clearance information. We need to make a second pass anyway,
		// and by then we'll have better information, so we can avoid this work now.
		if (BlockDone) {
		myatsinaUnsubmitted Done Reply Inline Actions For readability purpose - how about changing "BlockDone" to "breakDependency" and add your comment regarding done blocks before the call to processDefs? processDefs() will look like this: if (breakDependency) { // calc Pref ... } and processBasicBlock will look like this: // If this block is not done, it makes little sense ... bool breakDependency = isBlockDone(Done) processDefs(MI, breakDependency, ...) myatsina: For readability purpose - how about changing "BlockDone" to "breakDependency" and add your…
unsigned Pref = TII->getUndefRegClearance(*MI, OpNum, TRI);		unsigned Pref = TII->getUndefRegClearance(*MI, OpNum, TRI);
if (Pref) {		if (Pref) {
pickBestRegisterForUndef(MI, OpNum, Pref);		pickBestRegisterForUndef(MI, OpNum, Pref);
if (shouldBreakDependence(MI, OpNum, Pref))		if (shouldBreakDependence(MI, OpNum, Pref))
UndefReads.push_back(std::make_pair(MI, OpNum));		UndefReads.push_back(std::make_pair(MI, OpNum));
}		}
		}
const MCInstrDesc &MCID = MI->getDesc();		const MCInstrDesc &MCID = MI->getDesc();
for (unsigned i = 0,		for (unsigned i = 0,
e = MI->isVariadic() ? MI->getNumOperands() : MCID.getNumDefs();		e = MI->isVariadic() ? MI->getNumOperands() : MCID.getNumDefs();
i != e; ++i) {		i != e; ++i) {
MachineOperand &MO = MI->getOperand(i);		MachineOperand &MO = MI->getOperand(i);
if (!MO.isReg())		if (!MO.isReg())
continue;		continue;
if (MO.isUse())		if (MO.isUse())
continue;		continue;
for (int rx : regIndices(MO.getReg())) {		for (int rx : regIndices(MO.getReg())) {
// This instruction explicitly defines rx.		// This instruction explicitly defines rx.
DEBUG(dbgs() << TRI->getName(RC->getRegister(rx)) << ":\t" << CurInstr		DEBUG(dbgs() << TRI->getName(RC->getRegister(rx)) << ":\t" << CurInstr
<< '\t' << *MI);		<< '\t' << *MI);

		if (BlockDone) {
// Check clearance before partial register updates.		// Check clearance before partial register updates.
// Call breakDependence before setting LiveRegs[rx].Def.		// Call breakDependence before setting LiveRegs[rx].Def.
unsigned Pref = TII->getPartialRegUpdateClearance(*MI, i, TRI);		unsigned Pref = TII->getPartialRegUpdateClearance(*MI, i, TRI);
if (Pref && shouldBreakDependence(MI, i, Pref))		if (Pref && shouldBreakDependence(MI, i, Pref))
TII->breakPartialRegDependency(*MI, i, TRI);		TII->breakPartialRegDependency(*MI, i, TRI);
		}

// How many instructions since rx was last written?		// How many instructions since rx was last written?
LiveRegs[rx].Def = CurInstr;		LiveRegs[rx].Def = CurInstr;

// Kill off domains redefined by generic instructions.		// Kill off domains redefined by generic instructions.
if (Kill)		if (Kill)
kill(rx);		kill(rx);
}		}
▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	for (int rx : regIndices(mo.getReg())) {
if (!LiveRegs[rx].Value \|\| (mo.isDef() && LiveRegs[rx].Value != dv)) {		if (!LiveRegs[rx].Value \|\| (mo.isDef() && LiveRegs[rx].Value != dv)) {
kill(rx);		kill(rx);
setLiveReg(rx, dv);		setLiveReg(rx, dv);
}		}
}		}
}		}
}		}

		void ExeDepsFix::processBasicBlock(MachineBasicBlock *MBB, bool PrimaryPass,
		bool Done) {
		myatsinaUnsubmitted Done Reply Inline Actions Do you need "Done" here? I don't see you using it. myatsina: Do you need "Done" here? I don't see you using it.
		enterBasicBlock(MBB);
		for (MachineInstr &MI : *MBB) {
		if (!MI.isDebugValue()) {
		bool Kill = false;
		if (PrimaryPass)
		Kill = visitInstr(&MI);
		processDefs(&MI, isBlockDone(MBB), Kill);
		}
		}
		processUndefReads(MBB);
		myatsinaUnsubmitted Done Reply Inline Actions Is there a point going over it when we're not isBlockDone? processDefs pushes instructions into the undef read only when Done =true. processUndefReads does break if the undef reads are empty, but perhaps for readability purpose it's worth writing this explicitly. myatsina: Is there a point going over it when we're not isBlockDone? processDefs pushes instructions into…
		leaveBasicBlock(MBB);
		}

		bool ExeDepsFix::isBlockDone(MachineBasicBlock *MBB) {
		return MBBInfos[MBB].PrimaryCompleted &&
		MBBInfos[MBB].IncomingCompleted == MBBInfos[MBB].PrimaryIncoming &&
		MBBInfos[MBB].IncomingProcessed == MBB->pred_size();
		}

		void ExeDepsFix::updateSuccessors(MachineBasicBlock *MBB, bool Primary,
		bool Done) {
		for (auto *Succ : MBB->successors()) {
		if (!isBlockDone(Succ)) {
		if (Primary) {
		MBBInfos[Succ].IncomingProcessed++;
		}
		if (Done) {
		myatsinaUnsubmitted Done Reply Inline Actions Why not check isBasicBlockDone on MBB? myatsina: Why not check isBasicBlockDone on MBB?
		loladiroAuthorUnsubmitted Not Done Reply Inline Actions Just to avoid doing the calculation twice. Probably premature optimization. Will simplify. loladiro: Just to avoid doing the calculation twice. Probably premature optimization. Will simplify.
		MBBInfos[Succ].IncomingCompleted++;
		}
		if (isBlockDone(Succ)) {
		myatsinaUnsubmitted Done Reply Inline Actions At first glance, it wasn't clear to me why you need it here and why can't you just do one "primary" pass and then then process the basic blocks that are still not done. If I understood correctly, you're doing it here for the optimization you've talked about (Making sure the order is: PH A B C A' B' C' D). Am I right? I would add a comment here elaborating that. I would even consider adding as a comment somewhere with the loop example from the description of this patch and the "optimized" order the algorithm visits the nodes. I think it is a great example and it will make the traverse order here much clearer. myatsina: At first glance, it wasn't clear to me why you need it here and why can't you just do one…
		loladiroAuthorUnsubmitted Not Done Reply Inline Actions If I understood correctly, you're doing it here for the optimization you've talked about (Making sure the order is: PH A B C A' B' C' D). Am I right? Yes, that's correct. It's discussed below when going over the blocks that are not done. I'll add a small comment here. I will add the loop example from the commit message below and add a small comment here pointing people to the discussion below. loladiro: > If I understood correctly, you're doing it here for the optimization you've talked about…
		processBasicBlock(Succ, false, true);
		updateSuccessors(Succ, false, true);
		}
		}
		}
		}

bool ExeDepsFix::runOnMachineFunction(MachineFunction &mf) {		bool ExeDepsFix::runOnMachineFunction(MachineFunction &mf) {
if (skipFunction(*mf.getFunction()))		if (skipFunction(*mf.getFunction()))
return false;		return false;
MF = &mf;		MF = &mf;
TII = MF->getSubtarget().getInstrInfo();		TII = MF->getSubtarget().getInstrInfo();
TRI = MF->getSubtarget().getRegisterInfo();		TRI = MF->getSubtarget().getRegisterInfo();
RegClassInfo.runOnMachineFunction(mf);		RegClassInfo.runOnMachineFunction(mf);
LiveRegs = nullptr;		LiveRegs = nullptr;
Show All 20 Lines	if (AliasMap.empty()) {
// therefore the LiveRegs array.		// therefore the LiveRegs array.
AliasMap.resize(TRI->getNumRegs());		AliasMap.resize(TRI->getNumRegs());
for (unsigned i = 0, e = RC->getNumRegs(); i != e; ++i)		for (unsigned i = 0, e = RC->getNumRegs(); i != e; ++i)
for (MCRegAliasIterator AI(RC->getRegister(i), TRI, true);		for (MCRegAliasIterator AI(RC->getRegister(i), TRI, true);
AI.isValid(); ++AI)		AI.isValid(); ++AI)
AliasMap[*AI].push_back(i);		AliasMap[*AI].push_back(i);
}		}

		// Initialize the MMBInfos
		for (auto &MBB : mf) {
		MBBInfo InitialInfo{nullptr, false, 0, 0, 0};
		myatsinaUnsubmitted Done Reply Inline Actions Better use a constructor with default initialization like DomainValue does. myatsina: Better use a constructor with default initialization like DomainValue does.
		MBBInfos.insert(std::make_pair(&MBB, InitialInfo));
		}

MachineBasicBlock Entry = &MF->begin();		MachineBasicBlock Entry = &MF->begin();
ReversePostOrderTraversal<MachineBasicBlock*> RPOT(Entry);		ReversePostOrderTraversal<MachineBasicBlock*> RPOT(Entry);
SmallVector<MachineBasicBlock*, 16> Loops;
for (ReversePostOrderTraversal<MachineBasicBlock*>::rpo_iterator		for (ReversePostOrderTraversal<MachineBasicBlock*>::rpo_iterator
MBBI = RPOT.begin(), MBBE = RPOT.end(); MBBI != MBBE; ++MBBI) {		MBBI = RPOT.begin(), MBBE = RPOT.end(); MBBI != MBBE; ++MBBI) {
MachineBasicBlock MBB = MBBI;		MachineBasicBlock MBB = MBBI;
enterBasicBlock(MBB);		MBBInfos[MBB].PrimaryCompleted = true;
if (SeenUnknownBackEdge)		MBBInfos[MBB].PrimaryIncoming = MBBInfos[MBB].IncomingProcessed;
		myatsinaUnsubmitted Done Reply Inline Actions I would add a comment that IncomingProcessed and IncomingCompleted of this block were already updated during the processing of predecessor blocks. myatsina: I would add a comment that IncomingProcessed and IncomingCompleted of this block were already…
Loops.push_back(MBB);		bool PrimaryDone = isBlockDone(MBB);
for (MachineInstr &MI : *MBB)		processBasicBlock(MBB, true, PrimaryDone);
visitInstr(&MI);		updateSuccessors(MBB, true, PrimaryDone);
processUndefReads(MBB);
leaveBasicBlock(MBB);
}		}

// Visit all the loop blocks again in order to merge DomainValues from		// We need to go through again and finalize any blocks that are not done yet.
// back-edges.		// This is possible if blocks have dead predecessors, so we didn't visit them
for (MachineBasicBlock *MBB : Loops) {		// above. N.B.: The reason we update succesors immidately above, rather than
enterBasicBlock(MBB);		// doing everything in one go here, is to avoid having to do two passes on
for (MachineInstr &MI : *MBB)		// basic block between loops (with the scheme above, the whole loop will be
if (!MI.isDebugValue())		// completed before moving on to the blocks after it).
processDefs(&MI, false);		for (ReversePostOrderTraversal<MachineBasicBlock *>::rpo_iterator
processUndefReads(MBB);		MBBI = RPOT.begin(),
leaveBasicBlock(MBB);		MBBE = RPOT.end();
		MBBI != MBBE; ++MBBI) {
		MachineBasicBlock MBB = MBBI;
		if (!isBlockDone(MBB)) {
		processBasicBlock(MBB, false, true);
		// Don't update successors here. We'll get to them anyway through this
		// loop.
		}
}		}

// Clear the LiveOuts vectors and collapse any remaining DomainValues.		// Clear the LiveOuts vectors and collapse any remaining DomainValues.
for (ReversePostOrderTraversal<MachineBasicBlock*>::rpo_iterator		for (ReversePostOrderTraversal<MachineBasicBlock*>::rpo_iterator
MBBI = RPOT.begin(), MBBE = RPOT.end(); MBBI != MBBE; ++MBBI) {		MBBI = RPOT.begin(), MBBE = RPOT.end(); MBBI != MBBE; ++MBBI) {
LiveOutMap::const_iterator FI = LiveOuts.find(*MBBI);		auto FI = MBBInfos.find(*MBBI);
if (FI == LiveOuts.end() \|\| !FI->second)		if (FI == MBBInfos.end() \|\| !FI->second.OutRegs)
continue;		continue;
for (unsigned i = 0, e = NumRegs; i != e; ++i)		for (unsigned i = 0, e = NumRegs; i != e; ++i)
if (FI->second[i].Value)		if (FI->second.OutRegs[i].Value)
release(FI->second[i].Value);		release(FI->second.OutRegs[i].Value);
delete[] FI->second;		delete[] FI->second.OutRegs;
}		}
LiveOuts.clear();		MBBInfos.clear();
UndefReads.clear();		UndefReads.clear();
Avail.clear();		Avail.clear();
Allocator.DestroyAll();		Allocator.DestroyAll();

return false;		return false;
}		}

FunctionPass *		FunctionPass *
llvm::createExecutionDependencyFixPass(const TargetRegisterClass *RC) {		llvm::createExecutionDependencyFixPass(const TargetRegisterClass *RC) {
return new ExeDepsFix(RC);		return new ExeDepsFix(RC);
}		}

test/CodeGen/X86/break-false-dep.ll

	Show First 20 Lines • Show All 271 Lines • ▼ Show 20 Lines
	ret:			ret:
	ret i64 %s2			ret i64 %s2
	;AVX-LABEL:@loopclearence			;AVX-LABEL:@loopclearence
	;Registers 4-7 are not used and therefore one of them should be chosen			;Registers 4-7 are not used and therefore one of them should be chosen
	;AVX-NOT: {{%xmm[4-7]}}			;AVX-NOT: {{%xmm[4-7]}}
	;AVX: vcvtsi2sdq {{.*}}, [[XMM4_7:%xmm[4-7]]], {{%xmm[0-9]+}}			;AVX: vcvtsi2sdq {{.*}}, [[XMM4_7:%xmm[4-7]]], {{%xmm[0-9]+}}
	;AVX-NOT: [[XMM4_7]]			;AVX-NOT: [[XMM4_7]]
	}			}

				; Make sure we are making a smart choice regarding undef registers even for more
				; complicated loop structures. This example is the inner loop from
				; julia> a = falses(10000); a[1:4:end] = true
				; julia> linspace(1.0,2.0,10000)[a]
				define void @loopclearance2(double* nocapture %y, i64* %x, double %c1, double %c2, double %c3, double %c4, i64 %size) {
				entry:
				tail call void asm sideeffect "", "~{xmm7},~{dirflag},~{fpsr},~{flags}"()
				tail call void asm sideeffect "", "~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{dirflag},~{fpsr},~{flags}"()
				tail call void asm sideeffect "", "~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{dirflag},~{fpsr},~{flags}"()
				br label %loop

				loop:
				%phi_i = phi i64 [ 1, %entry ], [ %nexti, %loop_end ]
				%phi_j = phi i64 [ 1, %entry ], [ %nextj, %loop_end ]
				%phi_k = phi i64 [ 0, %entry ], [ %nextk, %loop_end ]
				br label %inner_loop

				inner_loop:
				%phi = phi i64 [ %phi_k, %loop ], [ %nextk, %inner_loop ]
				%idx = lshr i64 %phi, 6
				%inputptr = getelementptr i64, i64* %x, i64 %idx
				%input = load i64, i64* %inputptr, align 8
				%masked = and i64 %phi, 63
				%shiftedmasked = shl i64 1, %masked
				%maskedinput = and i64 %input, %shiftedmasked
				%cmp = icmp eq i64 %maskedinput, 0
				%nextk = add i64 %phi, 1
				br i1 %cmp, label %inner_loop, label %loop_end

				loop_end:
				%nexti = add i64 %phi_i, 1
				%nextj = add i64 %phi_j, 1
				; Register use, plus us clobbering 7-15 above, basically forces xmm7 here as
				myatsinaUnsubmitted Not Done Reply Inline Actions xmm7 --> xmm6 ? myatsina: xmm7 --> xmm6 ?
				; the only reasonable choice. The primary thing we care about is that it's
				; not one of the registers used in the loop (e.g. not the output reg here)
				;AVX-NOT: %xmm6
				;AVX: vcvtsi2sdq {{.*}}, %xmm6, {{%xmm[0-9]+}}
				;AVX-NOT: %xmm6
				%nexti_f = sitofp i64 %nexti to double
				%sub = fsub double %c1, %nexti_f
				%mul = fmul double %sub, %c2
				;AVX: vcvtsi2sdq {{.*}}, %xmm6, {{%xmm[0-9]+}}
				;AVX-NOT: %xmm6
				%phi_f = sitofp i64 %phi to double
				%mul2 = fmul double %phi_f, %c3
				%add2 = fadd double %mul, %mul2
				%div = fdiv double %add2, %c4
				%prev_j = add i64 %phi_j, -1
				%outptr = getelementptr double, double* %y, i64 %prev_j
				store double %div, double* %outptr, align 8
				%done = icmp slt i64 %size, %nexti
				br i1 %done, label %loopdone, label %loop

				loopdone:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[ExecutionDepsFix] Improve clearance calculation for loops
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84518

lib/CodeGen/ExecutionDepsFix.cpp

test/CodeGen/X86/break-false-dep.ll

This is an archive of the discontinued LLVM Phabricator instance.

[ExecutionDepsFix] Improve clearance calculation for loopsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84518

lib/CodeGen/ExecutionDepsFix.cpp

test/CodeGen/X86/break-false-dep.ll

[ExecutionDepsFix] Improve clearance calculation for loops
ClosedPublic