This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/CodeGen/
-
CodeGen/
1/4
MachinePipeliner.cpp
-
test/CodeGen/Hexagon/
-
CodeGen/
-
Hexagon/
-
SUnit-boundary-prob.ll
-
frame-offset-overflow.ll
-
swp-vmult.ll
-
vect/
-
vect-shuffle.ll

Differential D43620

[Pipeliner] Fixed node order issue related to zero latency edges
ClosedPublic

Authored by jwroorda on Feb 22 2018, 7:03 AM.

Download Raw Diff

Details

Reviewers

bcahoon

Commits

rG4b8bcf007b05: [Pipeliner] Fixed node order issue related to zero latency edges
rL326925: [Pipeliner] Fixed node order issue related to zero latency edges

Summary

A desired property of the node order in Swing Modulo Scheduling is
that for nodes outside circuits the following holds: none of them is
scheduled after both a successor and a predecessor. We call
node orders that meet this property valid.

Although invalid node orders do not lead to the generation of incorrect
code, they can cause the pipeliner not being able to find a pipelined schedule
for arbitrary II. The reason is that after scheduling the successor and the
predecessor of a node, no room may be left to schedule the node itself.

For data flow graphs with 0-latency edges, the node ordering algorithm
of Swing Modulo Scheduling can generate such undesired invalid node orders.
This patch fixes that.

In the remainder of this commit message, I will give an example
demonstrating the issue, explain the fix, and explain how the the fix is tested.

Consider, as an example, the following data flow graph with all
edge latencies 0 and all edges pointing downward.

   n0
  /  \
n1    n3
  \  /
   n2
    |
   n4

Consider the implemented node order algorithm in top-down mode. In that mode,
the algorithm orders the nodes based on greatest Height and in case of equal
Height on lowest Movability. Finally, in case of equal Height and
Movability, given two nodes with an edge between them, the algorithm prefers
the source-node.

In the graph, for every node, the Height and Movability are equal to 0.
As will be explained below, the algorithm can generate the order n0, n1, n2, n3, n4.
So, node n3 is scheduled after its predecessor n0 and after its successor n2.

The reason that the algorithm can put node n2 in the order before node n3,
even though they have an edge between them in which node n3 is the source,
is the following: Suppose the algorithm has constructed the partial node
order n0, n1. Then, the nodes left to be ordered are nodes n2, n3, and n4. Suppose
that the while-loop in the implemented algorithm considers the nodes in
the order n4, n3, n2. The algorithm will start with node n4, and look for
more preferable nodes. First, node n4 will be compared with node n3. As the nodes
have equal Height and Movability and have no edge between them, the algorithm
will stick with node n4. Then node n4 is compared with node n2. Again the
Height and Movability are equal. But, this time, there is an edge between
the two nodes, and the algorithm will prefer the source node n2.
As there are no nodes left to compare, the algorithm will add node n2 to
the node order, yielding the partial node order n0, n1, n2. In this way node n2
arrives in the node-order before node n3.

To solve this, this patch introduces the ZeroLatencyHeight (ZLH) property
for nodes. It is defined as the maximum unweighted length of a path from the
given node to an arbitrary node in which each edge has latency 0.
So, ZLH(n0)=3, ZLH(n1)=ZLH(n3)=2, ZLH(n2)=1, and ZLH(n4)=0

In this patch, the preference for a greater ZeroLatencyHeight
is added in the top-down mode of the node ordering algorithm, after the
preference for a greater Height, and before the preference for a
lower Movability.

Therefore, the two allowed node-orders are n0, n1, n3, n2, n4 and n0, n3, n1, n2, n4.
Both of them are valid node orders.

In the same way, the bottom-up mode of the node ordering algorithm is adapted
by introducing the ZeroLatencyDepth property for nodes.

The patch is tested by adding extra checks to the following existing
lit-tests:
test/CodeGen/Hexagon/SUnit-boundary-prob.ll
test/CodeGen/Hexagon/frame-offset-overflow.ll
test/CodeGen/Hexagon/vect/vect-shuffle.ll

Before this patch, the pipeliner failed to pipeline the loops in these tests
due to invalid node-orders. After the patch, the pipeliner successfully
pipelines all these loops.

Diff Detail

Repository: rL LLVM

Event Timeline

jwroorda created this revision.Feb 22 2018, 7:03 AM

Herald added a subscriber: mgrang. · View Herald TranscriptFeb 22 2018, 7:03 AM

jwroorda edited the summary of this revision. (Show Details)Feb 22 2018, 7:05 AM

jwroorda edited the summary of this revision. (Show Details)Feb 23 2018, 3:38 AM

jwroorda edited the summary of this revision. (Show Details)Feb 23 2018, 3:42 AM

Ayal added a subscriber: Ayal.Feb 25 2018, 6:25 AM

Thanks for adding a patch to the pipeliner! I think what the patch is attempting is a good thing to add to the heuristic. Though I would have expected that the check for hasDataDependence would generate the correct node order for the example provided in the comment? Maybe only if hasDataDependence is extended to handle other types of dependences? But, this patch is probably cheaper since it pre-computes the information. I like the addition of the isValidNode as well.

I'm still evaluating this patch on some of our internal tests, so I may have some additional comments. I'll reply in the next day or so.

Thanks,
Brendon

lib/CodeGen/MachinePipeliner.cpp
2084 ↗	(On Diff #135413)	I think this should be getZeroLatencyDepth(maxHeight)

Thanks for your comments!

I would have expected that the check for hasDataDependence would generate the correct node order for the example provided in the comment?

I see your point. I agree that intuitively/at first sight, one might expect that the check for hasDataDependence would guarantee a correct node order.
However, I don't think this is the case. I have tried to illustrate this in the example. That is, I have tried to explain why the algorithm
(using the hasDataDependence check) can arrive at a node-order in which node n3 is scheduled after its predecessor n0 and after its successor n2.

The crucial point in the example is that hasDataDependence is never called to check for a dependence between node n2 and n3.
Instead, it only checks for dependencies between nodes n4 and n3 and between nodes n4 and n2.

I don’t think that extending the hasDataDependence-check to other types of dependencies would help.
For the sake of argument, you can consider all dependencies in the example to be data-dependencies.

I have tried to explain the issue in more detail in the commit-message, see below.

Can you please check if you agree with the example in the commit-message?
If there is anything unclear, or if you believe that there is a mistake somewhere, please let me know.

Below follows the relevant part of the commit message:
"The reason that the algorithm can put node n2 in the order before node n3,
even though they have an edge between them in which node n3 is the source,
is the following: Suppose the algorithm has constructed the partial node
order n0, n1. Then, the nodes left to be ordered are nodes n2, n3, and n4. Suppose
that the while-loop in the implemented algorithm considers the nodes in
the order n4, n3, n2. The algorithm will start with node n4, and look for
more preferable nodes. First, node n4 will be compared with node n3. As the nodes
have equal Height and Movability and have no edge between them, the algorithm
will stick with node n4. Then node n4 is compared with node n2. Again the
Height and Movability are equal. But, this time, there is an edge between
the two nodes, and the algorithm will prefer the source node n2.
As there are no nodes left to compare, the algorithm will add node n2 to
the node order, yielding the partial node order n0, n1, n2. In this way node n2
arrives in the node-order before node n3."

lib/CodeGen/MachinePipeliner.cpp
2084 ↗	(On Diff #135413)	Can you explain why? I think that in, in top-down mode, nodes with greater height should be scheduled first.

bcahoon added inline comments.Feb 28 2018, 7:40 AM

lib/CodeGen/MachinePipeliner.cpp
2085 ↗	(On Diff #135413)	sorry, I add that comment to the wrong line. See below.
2130 ↗	(On Diff #135413)	this is the place where I meant to add the change - getZeroLatencyHeight(maxDepth) with getZeroLatencyDepth(maxDepth)

Thanks for the detailed explanation. That make sense, and I agree that the ZLD/ZLH is a more general/better solution.

lib/CodeGen/MachinePipeliner.cpp
1615 ↗	(On Diff #135413)	I think the zeroLatencyDepth (and Height) calculation needs to be done prior to the ignoreDependence() check. Otherwise, Anti dependence edges won't be considered in the ZLD/ZLH calculation.

I have made the suggested changes.

Thanks for your valuable feedback. I have made the changes that you have suggested.

lib/CodeGen/MachinePipeliner.cpp
1615 ↗	(On Diff #135413)	I agree. This is fixed now.
2130 ↗	(On Diff #135413)	Well spotted. This is fixed now.

Thanks again for the patch to the pipeliner. I think it looks good, so feel free to commit if you're able to after addressing the final comment.

Thanks,
Brendon

lib/CodeGen/MachinePipeliner.cpp
937 ↗	(On Diff #135413)	Should this call be moved to the DEBUG statement below? If the compiler is built without asserts, then this will generate a warning/error due to an unused variable.

This revision is now accepted and ready to land.Mar 1 2018, 8:04 PM

jwroorda updated this revision to Diff 136805.Mar 2 2018, 11:02 AM

jwroorda marked 2 inline comments as done.

jwroorda set the repository for this revision to rL LLVM.

jwroorda removed rL LLVM as the repository for this revision.Mar 2 2018, 11:07 AM

jwroorda marked an inline comment as done.Mar 2 2018, 11:10 AM

jwroorda added inline comments.

lib/CodeGen/MachinePipeliner.cpp
937 ↗	(On Diff #135413)	I see your point. However, if the call is moved inside the DEBUG statement, the statistics information is only updated when debug information is printed. Therefore, instead, I moved the printing of "Invalid node order found!" inside the checkValidNodeOrder function.

Closed by commit rL326925: [Pipeliner] Fixed node order issue related to zero latency edges (authored by jwroorda). · Explain WhyMar 7 2018, 10:58 AM

This revision was automatically updated to reflect the committed changes.

jwroorda marked an inline comment as done.

This change causes some issues. The following testcase crashes or runs until memory is exhausted.

define void @f0(i32 %a0) #0 {
b0:
  %v0 = ashr i32 %a0, 1
  br label %b1

b1:                                               ; preds = %b1, %b0
  %v1 = phi i64 [ %v7, %b1 ], [ 0, %b0 ]
  %v2 = phi i64 [ %v6, %b1 ], [ undef, %b0 ]
  %v3 = phi i64 [ %v8, %b1 ], [ undef, %b0 ]
  %v4 = phi i32 [ %v9, %b1 ], [ 0, %b0 ]
  %v5 = tail call i64 @llvm.hexagon.S2.shuffeh(i64 undef, i64 %v2)
  %v6 = tail call i64 @llvm.hexagon.A2.combinew(i32 undef, i32 undef)
  %v7 = tail call i64 @llvm.hexagon.M2.vdmacs.s0(i64 %v1, i64 %v3, i64 %v5)
  %v8 = tail call i64 @llvm.hexagon.A2.combinew(i32 undef, i32 undef)
  %v9 = add nsw i32 %v4, 1
  %v10 = icmp eq i32 %v9, %v0
  br i1 %v10, label %b2, label %b1

b2:                                               ; preds = %b1
  %v11 = trunc i64 %v7 to i32
  %v12 = bitcast i8* undef to i32*
  store i32 %v11, i32* %v12, align 4, !tbaa !0
  call void @llvm.trap()
  unreachable
}

declare i64 @llvm.hexagon.A2.combinew(i32, i32) #1
declare i64 @llvm.hexagon.M2.vdmacs.s0(i64, i64, i64) #1
declare i64 @llvm.hexagon.S2.shuffeh(i64, i64) #1
declare void @llvm.trap() #2

attributes #0 = { nounwind "target-cpu"="hexagonv60" "target-features"="+hvx,+hvx-length64b" }
attributes #1 = { nounwind readnone }
attributes #2 = { noreturn nounwind }

!0 = !{!1, !1, i64 0}
!1 = !{!"int", !2}
!2 = !{!"omnipotent char", !3}
!3 = !{!"Simple C/C++ TBAA"}

Run with llc -march=hexagon.

This change causes some issues. The following testcase crashes or runs until memory is exhausted.

Thanks for the feedback. I have been able to reproduce the issue.

However, I do not think the change necessarily *causes* the issue. Instead, I believe that it exposes an already existing issue.
The node order generated by the change is valid. I believe that the rest of the SWP-algorithm should be able to handle it.

I did some debugging:
For the given example, after the node order is generated, the pipeliner is able to find a schedule with II=1.
However, when the "orderDependence" function is called from inside the "finalizeSchedule" function,
it gets caught in an infinite recursion. This warrants further investigation.

I did some debugging:
For the given example, after the node order is generated, the pipeliner is able to find a schedule with II=1.
However, when the "orderDependence" function is called from inside the "finalizeSchedule" function,
it gets caught in an infinite recursion. This warrants further investigation.

Thanks for the information. I can take a look at that. I've fixed a couple of issues with similar symptoms in that function. I agree that it's probably not your patch that is the problem...

Sorry for commenting on an old merge PR, but was looking through some asserts and stumbled on this code which raised some questions.

llvm/trunk/lib/CodeGen/MachinePipeliner.cpp
3908	Why Indicies is initialized with NodeOrder.size() {nullptr, 0} pairs? Are they used anywhere?
3951	Is this dereference always valid? Can SuccSU unit be exit su and not be part of the NodeOrder?

Herald added a project: Restricted Project. · View Herald TranscriptJun 18 2020, 7:31 AM

danilaml added inline comments.Jun 22 2020, 8:28 AM

llvm/trunk/lib/CodeGen/MachinePipeliner.cpp
3951	Nvm, this. I see that this issue has been already fixed upstream.

jwroorda marked an inline comment as done.Jul 1 2020, 5:37 AM

jwroorda added inline comments.

llvm/trunk/lib/CodeGen/MachinePipeliner.cpp
3908	The initialization is not needed. It can be removed.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

MachinePipeliner.cpp

161 lines

test/

CodeGen/

Hexagon/

SUnit-boundary-prob.ll

6 lines

frame-offset-overflow.ll

11 lines

swp-vmult.ll

4 lines

vect/

vect-shuffle.ll

6 lines

Diff 137435

llvm/trunk/lib/CodeGen/MachinePipeliner.cpp

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines
#include <vector>		#include <vector>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "pipeliner"		#define DEBUG_TYPE "pipeliner"

STATISTIC(NumTrytoPipeline, "Number of loops that we attempt to pipeline");		STATISTIC(NumTrytoPipeline, "Number of loops that we attempt to pipeline");
STATISTIC(NumPipelined, "Number of loops software pipelined");		STATISTIC(NumPipelined, "Number of loops software pipelined");
		STATISTIC(NumNodeOrderIssues, "Number of node order issues found");

/// A command line option to turn software pipelining on or off.		/// A command line option to turn software pipelining on or off.
static cl::opt<bool> EnableSWP("enable-pipeliner", cl::Hidden, cl::init(true),		static cl::opt<bool> EnableSWP("enable-pipeliner", cl::Hidden, cl::init(true),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("Enable Software Pipelining"));		cl::desc("Enable Software Pipelining"));

/// A command line option to enable SWP at -Os.		/// A command line option to enable SWP at -Os.
static cl::opt<bool> EnableSWPOptSize("enable-pipeliner-opt-size",		static cl::opt<bool> EnableSWPOptSize("enable-pipeliner-opt-size",
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	class SwingSchedulerDAG : public ScheduleDAGInstrs {

/// A toplogical ordering of the SUnits, which is needed for changing		/// A toplogical ordering of the SUnits, which is needed for changing
/// dependences and iterating over the SUnits.		/// dependences and iterating over the SUnits.
ScheduleDAGTopologicalSort Topo;		ScheduleDAGTopologicalSort Topo;

struct NodeInfo {		struct NodeInfo {
int ASAP = 0;		int ASAP = 0;
int ALAP = 0;		int ALAP = 0;
		int ZeroLatencyDepth = 0;
		int ZeroLatencyHeight = 0;

NodeInfo() = default;		NodeInfo() = default;
};		};
/// Computed properties for each node in the graph.		/// Computed properties for each node in the graph.
std::vector<NodeInfo> ScheduleInfo;		std::vector<NodeInfo> ScheduleInfo;

enum OrderKind { BottomUp = 0, TopDown = 1 };		enum OrderKind { BottomUp = 0, TopDown = 1 };
/// Computed node ordering for scheduling.		/// Computed node ordering for scheduling.
▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	public:

/// The mobility function, which the number of slots in which		/// The mobility function, which the number of slots in which
/// an instruction may be scheduled.		/// an instruction may be scheduled.
int getMOV(SUnit *Node) { return getALAP(Node) - getASAP(Node); }		int getMOV(SUnit *Node) { return getALAP(Node) - getASAP(Node); }

/// The depth, in the dependence graph, for a node.		/// The depth, in the dependence graph, for a node.
int getDepth(SUnit *Node) { return Node->getDepth(); }		int getDepth(SUnit *Node) { return Node->getDepth(); }

		/// The maximum unweighted length of a path from an arbitrary node to the
		/// given node in which each edge has latency 0
		int getZeroLatencyDepth(SUnit *Node) {
		return ScheduleInfo[Node->NodeNum].ZeroLatencyDepth;
		}

/// The height, in the dependence graph, for a node.		/// The height, in the dependence graph, for a node.
int getHeight(SUnit *Node) { return Node->getHeight(); }		int getHeight(SUnit *Node) { return Node->getHeight(); }

		/// The maximum unweighted length of a path from the given node to an
		/// arbitrary node in which each edge has latency 0
		int getZeroLatencyHeight(SUnit *Node) {
		return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
		}

/// Return true if the dependence is a back-edge in the data dependence graph.		/// Return true if the dependence is a back-edge in the data dependence graph.
/// Since the DAG doesn't contain cycles, we represent a cycle in the graph		/// Since the DAG doesn't contain cycles, we represent a cycle in the graph
/// using an anti dependence from a Phi to an instruction.		/// using an anti dependence from a Phi to an instruction.
bool isBackedge(SUnit *Source, const SDep &Dep) {		bool isBackedge(SUnit *Source, const SDep &Dep) {
if (Dep.getKind() != SDep::Anti)		if (Dep.getKind() != SDep::Anti)
return false;		return false;
return Source->getInstr()->isPHI() \|\| Dep.getSUnit()->getInstr()->isPHI();		return Source->getInstr()->isPHI() \|\| Dep.getSUnit()->getInstr()->isPHI();
}		}
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	private:
void computeNodeFunctions(NodeSetType &NodeSets);		void computeNodeFunctions(NodeSetType &NodeSets);
void registerPressureFilter(NodeSetType &NodeSets);		void registerPressureFilter(NodeSetType &NodeSets);
void colocateNodeSets(NodeSetType &NodeSets);		void colocateNodeSets(NodeSetType &NodeSets);
void checkNodeSets(NodeSetType &NodeSets);		void checkNodeSets(NodeSetType &NodeSets);
void groupRemainingNodes(NodeSetType &NodeSets);		void groupRemainingNodes(NodeSetType &NodeSets);
void addConnectedNodes(SUnit *SU, NodeSet &NewSet,		void addConnectedNodes(SUnit *SU, NodeSet &NewSet,
SetVector<SUnit *> &NodesAdded);		SetVector<SUnit *> &NodesAdded);
void computeNodeOrder(NodeSetType &NodeSets);		void computeNodeOrder(NodeSetType &NodeSets);
		void checkValidNodeOrder(const NodeSetType &Circuits) const;
bool schedulePipeline(SMSchedule &Schedule);		bool schedulePipeline(SMSchedule &Schedule);
void generatePipelinedLoop(SMSchedule &Schedule);		void generatePipelinedLoop(SMSchedule &Schedule);
void generateProlog(SMSchedule &Schedule, unsigned LastStage,		void generateProlog(SMSchedule &Schedule, unsigned LastStage,
MachineBasicBlock KernelBB, ValueMapTy VRMap,		MachineBasicBlock KernelBB, ValueMapTy VRMap,
MBBVectorTy &PrologBBs);		MBBVectorTy &PrologBBs);
void generateEpilog(SMSchedule &Schedule, unsigned LastStage,		void generateEpilog(SMSchedule &Schedule, unsigned LastStage,
MachineBasicBlock KernelBB, ValueMapTy VRMap,		MachineBasicBlock KernelBB, ValueMapTy VRMap,
MBBVectorTy &EpilogBBs, MBBVectorTy &PrologBBs);		MBBVectorTy &EpilogBBs, MBBVectorTy &PrologBBs);
▲ Show 20 Lines • Show All 443 Lines • ▼ Show 20 Lines	void SwingSchedulerDAG::schedule() {
changeDependences();		changeDependences();
DEBUG({		DEBUG({
for (unsigned su = 0, e = SUnits.size(); su != e; ++su)		for (unsigned su = 0, e = SUnits.size(); su != e; ++su)
SUnits[su].dumpAll(this);		SUnits[su].dumpAll(this);
});		});

NodeSetType NodeSets;		NodeSetType NodeSets;
findCircuits(NodeSets);		findCircuits(NodeSets);
		NodeSetType Circuits = NodeSets;

// Calculate the MII.		// Calculate the MII.
unsigned ResMII = calculateResMII();		unsigned ResMII = calculateResMII();
unsigned RecMII = calculateRecMII(NodeSets);		unsigned RecMII = calculateRecMII(NodeSets);

fuseRecs(NodeSets);		fuseRecs(NodeSets);

// This flag is used for testing and can cause correctness problems.		// This flag is used for testing and can cause correctness problems.
Show All 37 Lines	DEBUG({
for (auto &I : NodeSets) {		for (auto &I : NodeSets) {
dbgs() << " NodeSet ";		dbgs() << " NodeSet ";
I.dump();		I.dump();
}		}
});		});

computeNodeOrder(NodeSets);		computeNodeOrder(NodeSets);

		// check for node order issues
		checkValidNodeOrder(Circuits);

SMSchedule Schedule(Pass.MF);		SMSchedule Schedule(Pass.MF);
Scheduled = schedulePipeline(Schedule);		Scheduled = schedulePipeline(Schedule);

if (!Scheduled)		if (!Scheduled)
return;		return;

unsigned numStages = Schedule.getMaxStageCount();		unsigned numStages = Schedule.getMaxStageCount();
// No need to generate pipeline if there are no overlapped iterations.		// No need to generate pipeline if there are no overlapped iterations.
▲ Show 20 Lines • Show All 636 Lines • ▼ Show 20 Lines	for (ScheduleDAGTopologicalSort::const_iterator I = Topo.begin(),
E = Topo.end();		E = Topo.end();
I != E; ++I) {		I != E; ++I) {
SUnit SU = &SUnits[I];		SUnit SU = &SUnits[I];
SU->dump(this);		SU->dump(this);
}		}
});		});

int maxASAP = 0;		int maxASAP = 0;
// Compute ASAP.		// Compute ASAP and ZeroLatencyDepth.
for (ScheduleDAGTopologicalSort::const_iterator I = Topo.begin(),		for (ScheduleDAGTopologicalSort::const_iterator I = Topo.begin(),
E = Topo.end();		E = Topo.end();
I != E; ++I) {		I != E; ++I) {
int asap = 0;		int asap = 0;
		int zeroLatencyDepth = 0;
SUnit SU = &SUnits[I];		SUnit SU = &SUnits[I];
for (SUnit::const_pred_iterator IP = SU->Preds.begin(),		for (SUnit::const_pred_iterator IP = SU->Preds.begin(),
EP = SU->Preds.end();		EP = SU->Preds.end();
IP != EP; ++IP) {		IP != EP; ++IP) {
		SUnit *pred = IP->getSUnit();
		if (getLatency(SU, *IP) == 0)
		zeroLatencyDepth =
		std::max(zeroLatencyDepth, getZeroLatencyDepth(pred) + 1);
if (ignoreDependence(*IP, true))		if (ignoreDependence(*IP, true))
continue;		continue;
SUnit *pred = IP->getSUnit();
asap = std::max(asap, (int)(getASAP(pred) + getLatency(SU, *IP) -		asap = std::max(asap, (int)(getASAP(pred) + getLatency(SU, *IP) -
getDistance(pred, SU, IP) MII));		getDistance(pred, SU, IP) MII));
}		}
maxASAP = std::max(maxASAP, asap);		maxASAP = std::max(maxASAP, asap);
ScheduleInfo[*I].ASAP = asap;		ScheduleInfo[*I].ASAP = asap;
		ScheduleInfo[*I].ZeroLatencyDepth = zeroLatencyDepth;
}		}

// Compute ALAP and MOV.		// Compute ALAP, ZeroLatencyHeight, and MOV.
for (ScheduleDAGTopologicalSort::const_reverse_iterator I = Topo.rbegin(),		for (ScheduleDAGTopologicalSort::const_reverse_iterator I = Topo.rbegin(),
E = Topo.rend();		E = Topo.rend();
I != E; ++I) {		I != E; ++I) {
int alap = maxASAP;		int alap = maxASAP;
		int zeroLatencyHeight = 0;
SUnit SU = &SUnits[I];		SUnit SU = &SUnits[I];
for (SUnit::const_succ_iterator IS = SU->Succs.begin(),		for (SUnit::const_succ_iterator IS = SU->Succs.begin(),
ES = SU->Succs.end();		ES = SU->Succs.end();
IS != ES; ++IS) {		IS != ES; ++IS) {
		SUnit *succ = IS->getSUnit();
		if (getLatency(SU, *IS) == 0)
		zeroLatencyHeight =
		std::max(zeroLatencyHeight, getZeroLatencyHeight(succ) + 1);
if (ignoreDependence(*IS, true))		if (ignoreDependence(*IS, true))
continue;		continue;
SUnit *succ = IS->getSUnit();
alap = std::min(alap, (int)(getALAP(succ) - getLatency(SU, *IS) +		alap = std::min(alap, (int)(getALAP(succ) - getLatency(SU, *IS) +
getDistance(SU, succ, IS) MII));		getDistance(SU, succ, IS) MII));
}		}

ScheduleInfo[*I].ALAP = alap;		ScheduleInfo[*I].ALAP = alap;
		ScheduleInfo[*I].ZeroLatencyHeight = zeroLatencyHeight;
}		}

// After computing the node functions, compute the summary for each node set.		// After computing the node functions, compute the summary for each node set.
for (NodeSet &I : NodeSets)		for (NodeSet &I : NodeSets)
I.computeNodeSetInfo(this);		I.computeNodeSetInfo(this);

DEBUG({		DEBUG({
for (unsigned i = 0; i < SUnits.size(); i++) {		for (unsigned i = 0; i < SUnits.size(); i++) {
dbgs() << "\tNode " << i << ":\n";		dbgs() << "\tNode " << i << ":\n";
dbgs() << "\t ASAP = " << getASAP(&SUnits[i]) << "\n";		dbgs() << "\t ASAP = " << getASAP(&SUnits[i]) << "\n";
dbgs() << "\t ALAP = " << getALAP(&SUnits[i]) << "\n";		dbgs() << "\t ALAP = " << getALAP(&SUnits[i]) << "\n";
dbgs() << "\t MOV = " << getMOV(&SUnits[i]) << "\n";		dbgs() << "\t MOV = " << getMOV(&SUnits[i]) << "\n";
dbgs() << "\t D = " << getDepth(&SUnits[i]) << "\n";		dbgs() << "\t D = " << getDepth(&SUnits[i]) << "\n";
dbgs() << "\t H = " << getHeight(&SUnits[i]) << "\n";		dbgs() << "\t H = " << getHeight(&SUnits[i]) << "\n";
		dbgs() << "\t ZLD = " << getZeroLatencyDepth(&SUnits[i]) << "\n";
		dbgs() << "\t ZLH = " << getZeroLatencyHeight(&SUnits[i]) << "\n";
}		}
});		});
}		}

/// Compute the Pred_L(O) set, as defined in the paper. The set is defined		/// Compute the Pred_L(O) set, as defined in the paper. The set is defined
/// as the predecessors of the elements of NodeOrder that are not also in		/// as the predecessors of the elements of NodeOrder that are not also in
/// NodeOrder.		/// NodeOrder.
static bool pred_L(SetVector<SUnit *> &NodeOrder,		static bool pred_L(SetVector<SUnit *> &NodeOrder,
▲ Show 20 Lines • Show All 352 Lines • ▼ Show 20 Lines	for (NodeSetType::iterator J = I + 1; J != E;) {
NodeSets.erase(J);		NodeSets.erase(J);
E = NodeSets.end();		E = NodeSets.end();
} else {		} else {
++J;		++J;
}		}
}		}
}		}

/// Return true if Inst1 defines a value that is used in Inst2.
static bool hasDataDependence(SUnit Inst1, SUnit Inst2) {
for (auto &SI : Inst1->Succs)
if (SI.getSUnit() == Inst2 && SI.getKind() == SDep::Data)
return true;
return false;
}

/// Compute an ordered list of the dependence graph nodes, which		/// Compute an ordered list of the dependence graph nodes, which
/// indicates the order that the nodes will be scheduled. This is a		/// indicates the order that the nodes will be scheduled. This is a
/// two-level algorithm. First, a partial order is created, which		/// two-level algorithm. First, a partial order is created, which
/// consists of a list of sets ordered from highest to lowest priority.		/// consists of a list of sets ordered from highest to lowest priority.
void SwingSchedulerDAG::computeNodeOrder(NodeSetType &NodeSets) {		void SwingSchedulerDAG::computeNodeOrder(NodeSetType &NodeSets) {
SmallSetVector<SUnit *, 8> R;		SmallSetVector<SUnit *, 8> R;
NodeOrder.clear();		NodeOrder.clear();

Show All 30 Lines	if (pred_L(NodeOrder, N) && isSubset(N, Nodes)) {
R.insert(maxASAP);		R.insert(maxASAP);
Order = BottomUp;		Order = BottomUp;
DEBUG(dbgs() << " Bottom up (default) ");		DEBUG(dbgs() << " Bottom up (default) ");
}		}

while (!R.empty()) {		while (!R.empty()) {
if (Order == TopDown) {		if (Order == TopDown) {
// Choose the node with the maximum height. If more than one, choose		// Choose the node with the maximum height. If more than one, choose
// the node with the lowest MOV. If still more than one, check if there		// the node with the maximum ZeroLatencyHeight. If still more than one,
// is a dependence between the instructions.		// choose the node with the lowest MOV.
while (!R.empty()) {		while (!R.empty()) {
SUnit *maxHeight = nullptr;		SUnit *maxHeight = nullptr;
for (SUnit *I : R) {		for (SUnit *I : R) {
if (maxHeight == nullptr \|\| getHeight(I) > getHeight(maxHeight))		if (maxHeight == nullptr \|\| getHeight(I) > getHeight(maxHeight))
maxHeight = I;		maxHeight = I;
else if (getHeight(I) == getHeight(maxHeight) &&		else if (getHeight(I) == getHeight(maxHeight) &&
getMOV(I) < getMOV(maxHeight) &&		getZeroLatencyHeight(I) > getZeroLatencyHeight(maxHeight))
!hasDataDependence(maxHeight, I))
maxHeight = I;		maxHeight = I;
else if (hasDataDependence(I, maxHeight))		else if (getHeight(I) == getHeight(maxHeight) &&
		getZeroLatencyHeight(I) ==
		getZeroLatencyHeight(maxHeight) &&
		getMOV(I) < getMOV(maxHeight))
maxHeight = I;		maxHeight = I;
}		}
NodeOrder.insert(maxHeight);		NodeOrder.insert(maxHeight);
DEBUG(dbgs() << maxHeight->NodeNum << " ");		DEBUG(dbgs() << maxHeight->NodeNum << " ");
R.remove(maxHeight);		R.remove(maxHeight);
for (const auto &I : maxHeight->Succs) {		for (const auto &I : maxHeight->Succs) {
if (Nodes.count(I.getSUnit()) == 0)		if (Nodes.count(I.getSUnit()) == 0)
continue;		continue;
Show All 16 Lines	while (!R.empty()) {
}		}
Order = BottomUp;		Order = BottomUp;
DEBUG(dbgs() << "\n Switching order to bottom up ");		DEBUG(dbgs() << "\n Switching order to bottom up ");
SmallSetVector<SUnit *, 8> N;		SmallSetVector<SUnit *, 8> N;
if (pred_L(NodeOrder, N, &Nodes))		if (pred_L(NodeOrder, N, &Nodes))
R.insert(N.begin(), N.end());		R.insert(N.begin(), N.end());
} else {		} else {
// Choose the node with the maximum depth. If more than one, choose		// Choose the node with the maximum depth. If more than one, choose
// the node with the lowest MOV. If there is still more than one, check		// the node with the maximum ZeroLatencyDepth. If still more than one,
// for a dependence between the instructions.		// choose the node with the lowest MOV.
while (!R.empty()) {		while (!R.empty()) {
SUnit *maxDepth = nullptr;		SUnit *maxDepth = nullptr;
for (SUnit *I : R) {		for (SUnit *I : R) {
if (maxDepth == nullptr \|\| getDepth(I) > getDepth(maxDepth))		if (maxDepth == nullptr \|\| getDepth(I) > getDepth(maxDepth))
maxDepth = I;		maxDepth = I;
else if (getDepth(I) == getDepth(maxDepth) &&		else if (getDepth(I) == getDepth(maxDepth) &&
getMOV(I) < getMOV(maxDepth) &&		getZeroLatencyDepth(I) > getZeroLatencyDepth(maxDepth))
!hasDataDependence(I, maxDepth))
maxDepth = I;		maxDepth = I;
else if (hasDataDependence(maxDepth, I))		else if (getDepth(I) == getDepth(maxDepth) &&
		getZeroLatencyDepth(I) == getZeroLatencyDepth(maxDepth) &&
		getMOV(I) < getMOV(maxDepth))
maxDepth = I;		maxDepth = I;
}		}
NodeOrder.insert(maxDepth);		NodeOrder.insert(maxDepth);
DEBUG(dbgs() << maxDepth->NodeNum << " ");		DEBUG(dbgs() << maxDepth->NodeNum << " ");
R.remove(maxDepth);		R.remove(maxDepth);
if (Nodes.isExceedSU(maxDepth)) {		if (Nodes.isExceedSU(maxDepth)) {
Order = TopDown;		Order = TopDown;
R.clear();		R.clear();
▲ Show 20 Lines • Show All 1,752 Lines • ▼ Show 20 Lines	for (auto &SI : SU.Succs)
if (SI.isAssignedRegDep())		if (SI.isAssignedRegDep())
if (ST.getRegisterInfo()->isPhysicalRegister(SI.getReg()))		if (ST.getRegisterInfo()->isPhysicalRegister(SI.getReg()))
if (stageScheduled(SI.getSUnit()) != StageDef)		if (stageScheduled(SI.getSUnit()) != StageDef)
return false;		return false;
}		}
return true;		return true;
}		}

		/// A property of the node order in swing-modulo-scheduling is
		/// that for nodes outside circuits the following holds:
		/// none of them is scheduled after both a successor and a
		/// predecessor.
		/// The method below checks whether the property is met.
		/// If not, debug information is printed and statistics information updated.
		/// Note that we do not use an assert statement.
		/// The reason is that although an invalid node oder may prevent
		/// the pipeliner from finding a pipelined schedule for arbitrary II,
		/// it does not lead to the generation of incorrect code.
		void SwingSchedulerDAG::checkValidNodeOrder(const NodeSetType &Circuits) const {

		// a sorted vector that maps each SUnit to its index in the NodeOrder
		typedef std::pair<SUnit *, unsigned> UnitIndex;
		std::vector<UnitIndex> Indices(NodeOrder.size(), std::make_pair(nullptr, 0));
		danilamlUnsubmitted Not Done Reply Inline Actions Why Indicies is initialized with NodeOrder.size() {nullptr, 0} pairs? Are they used anywhere? danilaml: Why Indicies is initialized with NodeOrder.size() {nullptr, 0} pairs? Are they used anywhere?
		jwroordaAuthorUnsubmitted Done Reply Inline Actions The initialization is not needed. It can be removed. jwroorda: The initialization is not needed. It can be removed.

		for (unsigned i = 0, s = NodeOrder.size(); i < s; ++i)
		Indices.push_back(std::make_pair(NodeOrder[i], i));

		auto CompareKey = [](UnitIndex i1, UnitIndex i2) {
		return std::get<0>(i1) < std::get<0>(i2);
		};

		// sort, so that we can perform a binary search
		std::sort(Indices.begin(), Indices.end(), CompareKey);

		bool Valid = true;
		// for each SUnit in the NodeOrder, check whether
		// it appears after both a successor and a predecessor
		// of the SUnit. If this is the case, and the SUnit
		// is not part of circuit, then the NodeOrder is not
		// valid.
		for (unsigned i = 0, s = NodeOrder.size(); i < s; ++i) {
		SUnit *SU = NodeOrder[i];
		unsigned Index = i;

		bool PredBefore = false;
		bool SuccBefore = false;

		SUnit *Succ;
		SUnit *Pred;

		for (SDep &PredEdge : SU->Preds) {
		SUnit *PredSU = PredEdge.getSUnit();
		unsigned PredIndex =
		std::get<1>(*std::lower_bound(Indices.begin(), Indices.end(),
		std::make_pair(PredSU, 0), CompareKey));
		if (!PredSU->getInstr()->isPHI() && PredIndex < Index) {
		PredBefore = true;
		Pred = PredSU;
		break;
		}
		}

		for (SDep &SuccEdge : SU->Succs) {
		SUnit *SuccSU = SuccEdge.getSUnit();
		unsigned SuccIndex =
		std::get<1>(*std::lower_bound(Indices.begin(), Indices.end(),
		danilamlUnsubmitted Not Done Reply Inline Actions Is this dereference always valid? Can SuccSU unit be exit su and not be part of the NodeOrder? danilaml: Is this dereference always valid? Can SuccSU unit be exit su and not be part of the NodeOrder?
		danilamlUnsubmitted Not Done Reply Inline Actions Nvm, this. I see that this issue has been already fixed upstream. danilaml: Nvm, this. I see that this issue has been already fixed upstream.
		std::make_pair(SuccSU, 0), CompareKey));
		if (!SuccSU->getInstr()->isPHI() && SuccIndex < Index) {
		SuccBefore = true;
		Succ = SuccSU;
		break;
		}
		}

		if (PredBefore && SuccBefore && !SU->getInstr()->isPHI()) {
		// instructions in circuits are allowed to be scheduled
		// after both a successor and predecessor.
		bool InCircuit = std::any_of(
		Circuits.begin(), Circuits.end(),
		[SU](const NodeSet &Circuit) { return Circuit.count(SU); });
		if (InCircuit)
		DEBUG(dbgs() << "In a circuit, predecessor ";);
		else {
		Valid = false;
		NumNodeOrderIssues++;
		DEBUG(dbgs() << "Predecessor ";);
		}
		DEBUG(dbgs() << Pred->NodeNum << " and successor " << Succ->NodeNum
		<< " are scheduled before node " << SU->NodeNum << "\n";);
		}
		}

		DEBUG({
		if (!Valid)
		dbgs() << "Invalid node order found!\n";
		});
		}

/// Attempt to fix the degenerate cases when the instruction serialization		/// Attempt to fix the degenerate cases when the instruction serialization
/// causes the register lifetimes to overlap. For example,		/// causes the register lifetimes to overlap. For example,
/// p' = store_pi(p, b)		/// p' = store_pi(p, b)
/// = load p, offset		/// = load p, offset
/// In this case p and p' overlap, which means that two registers are needed.		/// In this case p and p' overlap, which means that two registers are needed.
/// Instead, this function changes the load to use p' and updates the offset.		/// Instead, this function changes the load to use p' and updates the offset.
void SwingSchedulerDAG::fixupRegisterOverlaps(std::deque<SUnit *> &Instrs) {		void SwingSchedulerDAG::fixupRegisterOverlaps(std::deque<SUnit *> &Instrs) {
unsigned OverlapReg = 0;		unsigned OverlapReg = 0;
▲ Show 20 Lines • Show All 153 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/Hexagon/SUnit-boundary-prob.ll

	; RUN: llc -march=hexagon -O2 -mcpu=hexagonv60 < %s \| FileCheck %s			; REQUIRES: asserts
				; RUN: llc -march=hexagon -O2 -mcpu=hexagonv60 --stats -o - 2>&1 < %s \| FileCheck %s
	; This was aborting while processing SUnits.			; This was aborting while processing SUnits.

	; CHECK: vmem			; CHECK: vmem

				; CHECK-NOT: Number of node order issues found
				; CHECK: Number of loops software pipelined
				; CHECK-NOT: Number of node order issues found
	source_filename = "bugpoint-output-bdb0052.bc"			source_filename = "bugpoint-output-bdb0052.bc"
	target datalayout = "e-m:e-p:32:32:32-a:0-n16:32-i64:64:64-i32:32:32-i16:16:16-i1:8:8-f32:32:32-f64:64:64-v32:32:32-v64:64:64-v512:512:512-v1024:1024:1024-v2048:2048:2048"			target datalayout = "e-m:e-p:32:32:32-a:0-n16:32-i64:64:64-i32:32:32-i16:16:16-i1:8:8-f32:32:32-f64:64:64-v32:32:32-v64:64:64-v512:512:512-v1024:1024:1024-v2048:2048:2048"
	target triple = "hexagon-unknown--elf"			target triple = "hexagon-unknown--elf"

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	declare <16 x i32> @llvm.hexagon.V6.lo(<32 x i32>) #0			declare <16 x i32> @llvm.hexagon.V6.lo(<32 x i32>) #0

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	▲ Show 20 Lines • Show All 189 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/Hexagon/frame-offset-overflow.ll

	; RUN: llc -march=hexagon < %s \| FileCheck %s			; REQUIRES: asserts
				; RUN: llc -march=hexagon --stats -o - 2>&1 < %s \| FileCheck %s

	; In reality, check that the compilation succeeded and that some code was			; Check that the compilation succeeded and that some code was generated.
	; generated.
	; CHECK: vadd			; CHECK: vadd

				; Check that the loop is pipelined and that a valid node order is used.
				; CHECK-NOT: Number of node order issues found
				; CHECK: Number of loops software pipelined
				; CHECK-NOT: Number of node order issues found

	target triple = "hexagon"			target triple = "hexagon"

	define void @fred(i16* noalias nocapture readonly %p0, i32 %p1, i32 %p2, i16* noalias nocapture %p3, i32 %p4) local_unnamed_addr #1 {			define void @fred(i16* noalias nocapture readonly %p0, i32 %p1, i32 %p2, i16* noalias nocapture %p3, i32 %p4) local_unnamed_addr #1 {
	entry:			entry:
	%mul = mul i32 %p4, %p1			%mul = mul i32 %p4, %p1
	%add.ptr = getelementptr inbounds i16, i16* %p0, i32 %mul			%add.ptr = getelementptr inbounds i16, i16* %p0, i32 %mul
	%add = add nsw i32 %p4, 1			%add = add nsw i32 %p4, 1
	%rem = srem i32 %add, 5			%rem = srem i32 %add, 5
	▲ Show 20 Lines • Show All 149 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/Hexagon/swp-vmult.ll

	; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-pipeliner < %s \| FileCheck %s			; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-pipeliner < %s \| FileCheck %s
	; RUN: llc -march=hexagon -mcpu=hexagonv5 -O3 < %s \| FileCheck %s			; RUN: llc -march=hexagon -mcpu=hexagonv5 -O3 < %s \| FileCheck %s

	; Multiply and accumulate			; Multiply and accumulate
	; CHECK: mpyi([[REG0:r([0-9]+)]],[[REG1:r([0-9]+)]])			; CHECK: mpyi([[REG0:r([0-9]+)]],[[REG1:r([0-9]+)]])
	; CHECK-NEXT: add(r{{[0-9]+}},#4)			; CHECK-NEXT: add(r{{[0-9]+}},#4)
	; CHECK-NEXT: [[REG0]] = memw(r{{[0-9]+}}+r{{[0-9]+}}<<#0)			; CHECK-DAG: [[REG1]] = memw(r{{[0-9]+}}+r{{[0-9]+}}<<#0)
	; CHECK-NEXT: [[REG1]] = memw(r{{[0-9]+}}+r{{[0-9]+}}<<#0)			; CHECK-DAG: [[REG0]] = memw(r{{[0-9]+}}+r{{[0-9]+}}<<#0)
	; CHECK-NEXT: endloop0			; CHECK-NEXT: endloop0

	define i32 @foo(i32* %a, i32* %b, i32 %n) {			define i32 @foo(i32* %a, i32* %b, i32 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%sum.03 = phi i32 [ 0, %entry ], [ %add, %for.body ]			%sum.03 = phi i32 [ 0, %entry ], [ %add, %for.body ]
	Show All 17 Lines

llvm/trunk/test/CodeGen/Hexagon/vect/vect-shuffle.ll

	; RUN: llc -march=hexagon -mcpu=hexagonv5 -disable-hsdr < %s \| FileCheck %s			; REQUIRES: asserts
				; RUN: llc -march=hexagon -mcpu=hexagonv5 -disable-hsdr --stats -o - 2>&1 < %s \| FileCheck %s

	; Check that store is post-incremented.			; Check that store is post-incremented.
	; CHECK-NOT: extractu(r{{[0-9]+}},#32,			; CHECK-NOT: extractu(r{{[0-9]+}},#32,
	; CHECK-NOT: insert			; CHECK-NOT: insert
				; CHECK-NOT: Number of node order issues found
				; CHECK: Number of loops software pipelined
				; CHECK-NOT: Number of node order issues found
	target datalayout = "e-p:32:32:32-i64:64:64-i32:32:32-i16:16:16-i1:32:32-f64:64:64-f32:32:32-v64:64:64-v32:32:32-a0:0-n16:32"			target datalayout = "e-p:32:32:32-i64:64:64-i32:32:32-i16:16:16-i1:32:32-f64:64:64-f32:32:32-v64:64:64-v32:32:32-a0:0-n16:32"
	target triple = "hexagon"			target triple = "hexagon"

	define i32 @foo(i16* noalias nocapture %src, i16* noalias nocapture %dstImg, i32 %width, i32 %idx, i32 %flush) #0 {			define i32 @foo(i16* noalias nocapture %src, i16* noalias nocapture %dstImg, i32 %width, i32 %idx, i32 %flush) #0 {
	entry:			entry:
	%0 = tail call i64 @llvm.hexagon.A2.combinew(i32 %flush, i32 %flush)			%0 = tail call i64 @llvm.hexagon.A2.combinew(i32 %flush, i32 %flush)
	%1 = bitcast i64 %0 to <2 x i32>			%1 = bitcast i64 %0 to <2 x i32>
	br label %polly.loop_body			br label %polly.loop_body
	Show All 34 Lines