# [Pipeliner] Fixed node order issue related to zero latency edgesClosedPublicActions

Authored by jwroorda on Feb 22 2018, 7:03 AM.

# Details

Summary

A desired property of the node order in Swing Modulo Scheduling is
that for nodes outside circuits the following holds: none of them is
scheduled after both a successor and a predecessor. We call
node orders that meet this property valid.

Although invalid node orders do not lead to the generation of incorrect
code, they can cause the pipeliner not being able to find a pipelined schedule
for arbitrary II. The reason is that after scheduling the successor and the
predecessor of a node, no room may be left to schedule the node itself.

For data flow graphs with 0-latency edges, the node ordering algorithm
of Swing Modulo Scheduling can generate such undesired invalid node orders.
This patch fixes that.

In the remainder of this commit message, I will give an example
demonstrating the issue, explain the fix, and explain how the the fix is tested.

Consider, as an example, the following data flow graph with all
edge latencies 0 and all edges pointing downward.

```   n0
/  \
n1    n3
\  /
n2
|
n4```

Consider the implemented node order algorithm in top-down mode. In that mode,
the algorithm orders the nodes based on greatest Height and in case of equal
Height on lowest Movability. Finally, in case of equal Height and
Movability, given two nodes with an edge between them, the algorithm prefers
the source-node.

In the graph, for every node, the Height and Movability are equal to 0.
As will be explained below, the algorithm can generate the order n0, n1, n2, n3, n4.
So, node n3 is scheduled after its predecessor n0 and after its successor n2.

The reason that the algorithm can put node n2 in the order before node n3,
even though they have an edge between them in which node n3 is the source,
is the following: Suppose the algorithm has constructed the partial node
order n0, n1. Then, the nodes left to be ordered are nodes n2, n3, and n4. Suppose
that the while-loop in the implemented algorithm considers the nodes in
the order n4, n3, n2. The algorithm will start with node n4, and look for
more preferable nodes. First, node n4 will be compared with node n3. As the nodes
have equal Height and Movability and have no edge between them, the algorithm
will stick with node n4. Then node n4 is compared with node n2. Again the
Height and Movability are equal. But, this time, there is an edge between
the two nodes, and the algorithm will prefer the source node n2.
As there are no nodes left to compare, the algorithm will add node n2 to
the node order, yielding the partial node order n0, n1, n2. In this way node n2
arrives in the node-order before node n3.

To solve this, this patch introduces the ZeroLatencyHeight (ZLH) property
for nodes. It is defined as the maximum unweighted length of a path from the
given node to an arbitrary node in which each edge has latency 0.
So, ZLH(n0)=3, ZLH(n1)=ZLH(n3)=2, ZLH(n2)=1, and ZLH(n4)=0

In this patch, the preference for a greater ZeroLatencyHeight
is added in the top-down mode of the node ordering algorithm, after the
preference for a greater Height, and before the preference for a
lower Movability.

Therefore, the two allowed node-orders are n0, n1, n3, n2, n4 and n0, n3, n1, n2, n4.
Both of them are valid node orders.

In the same way, the bottom-up mode of the node ordering algorithm is adapted
by introducing the ZeroLatencyDepth property for nodes.

The patch is tested by adding extra checks to the following existing
lit-tests:
test/CodeGen/Hexagon/SUnit-boundary-prob.ll
test/CodeGen/Hexagon/frame-offset-overflow.ll
test/CodeGen/Hexagon/vect/vect-shuffle.ll

Before this patch, the pipeliner failed to pipeline the loops in these tests
due to invalid node-orders. After the patch, the pipeliner successfully
pipelines all these loops.

# Diff Detail

Repository
rL LLVM

### Event Timeline

jwroorda created this revision.Feb 22 2018, 7:03 AM
Herald added a subscriber: mgrang. Feb 22 2018, 7:03 AM
jwroorda edited the summary of this revision. (Show Details)Feb 22 2018, 7:05 AM
jwroorda edited the summary of this revision. (Show Details)Feb 23 2018, 3:38 AM
jwroorda edited the summary of this revision. (Show Details)Feb 23 2018, 3:42 AM
Ayal added a subscriber: Ayal.Feb 25 2018, 6:25 AM

Thanks for adding a patch to the pipeliner! I think what the patch is attempting is a good thing to add to the heuristic. Though I would have expected that the check for hasDataDependence would generate the correct node order for the example provided in the comment? Maybe only if hasDataDependence is extended to handle other types of dependences? But, this patch is probably cheaper since it pre-computes the information. I like the addition of the isValidNode as well.

I'm still evaluating this patch on some of our internal tests, so I may have some additional comments. I'll reply in the next day or so.

Thanks,
Brendon

lib/CodeGen/MachinePipeliner.cpp
2084 ↗(On Diff #135413)

I think this should be getZeroLatencyDepth(maxHeight)

I would have expected that the check for hasDataDependence would generate the correct node order for the example provided in the comment?

I see your point. I agree that intuitively/at first sight, one might expect that the check for hasDataDependence would guarantee a correct node order.
However, I don't think this is the case. I have tried to illustrate this in the example. That is, I have tried to explain why the algorithm
(using the hasDataDependence check) can arrive at a node-order in which node n3 is scheduled after its predecessor n0 and after its successor n2.

The crucial point in the example is that hasDataDependence is never called to check for a dependence between node n2 and n3.
Instead, it only checks for dependencies between nodes n4 and n3 and between nodes n4 and n2.

I don’t think that extending the hasDataDependence-check to other types of dependencies would help.
For the sake of argument, you can consider all dependencies in the example to be data-dependencies.

I have tried to explain the issue in more detail in the commit-message, see below.

Can you please check if you agree with the example in the commit-message?
If there is anything unclear, or if you believe that there is a mistake somewhere, please let me know.

Below follows the relevant part of the commit message:
"The reason that the algorithm can put node n2 in the order before node n3,
even though they have an edge between them in which node n3 is the source,
is the following: Suppose the algorithm has constructed the partial node
order n0, n1. Then, the nodes left to be ordered are nodes n2, n3, and n4. Suppose
that the while-loop in the implemented algorithm considers the nodes in
the order n4, n3, n2. The algorithm will start with node n4, and look for
more preferable nodes. First, node n4 will be compared with node n3. As the nodes
have equal Height and Movability and have no edge between them, the algorithm
will stick with node n4. Then node n4 is compared with node n2. Again the
Height and Movability are equal. But, this time, there is an edge between
the two nodes, and the algorithm will prefer the source node n2.
As there are no nodes left to compare, the algorithm will add node n2 to
the node order, yielding the partial node order n0, n1, n2. In this way node n2
arrives in the node-order before node n3."

lib/CodeGen/MachinePipeliner.cpp
2084 ↗(On Diff #135413)

Can you explain why? I think that in, in top-down mode, nodes with greater height should be scheduled first.

lib/CodeGen/MachinePipeliner.cpp
2085 ↗(On Diff #135413)

sorry, I add that comment to the wrong line. See below.

2130 ↗(On Diff #135413)

this is the place where I meant to add the change -
getZeroLatencyHeight(maxDepth) with getZeroLatencyDepth(maxDepth)

Thanks for the detailed explanation. That make sense, and I agree that the ZLD/ZLH is a more general/better solution.

lib/CodeGen/MachinePipeliner.cpp
1615 ↗(On Diff #135413)

I think the zeroLatencyDepth (and Height) calculation needs to be done prior to the ignoreDependence() check. Otherwise, Anti dependence edges won't be considered in the ZLD/ZLH calculation.

jwroorda updated this revision to Diff 136511.Mar 1 2018, 6:44 AM

I have made the suggested changes.

jwroorda marked 2 inline comments as done.Mar 1 2018, 6:51 AM

Thanks for your valuable feedback. I have made the changes that you have suggested.

lib/CodeGen/MachinePipeliner.cpp
1615 ↗(On Diff #135413)

I agree. This is fixed now.

2130 ↗(On Diff #135413)

Well spotted. This is fixed now.

bcahoon accepted this revision.Mar 1 2018, 8:04 PM

Thanks again for the patch to the pipeliner. I think it looks good, so feel free to commit if you're able to after addressing the final comment.

Thanks,
Brendon

lib/CodeGen/MachinePipeliner.cpp
937 ↗(On Diff #135413)

Should this call be moved to the DEBUG statement below? If the compiler is built without asserts, then this will generate a warning/error due to an unused variable.

This revision is now accepted and ready to land.Mar 1 2018, 8:04 PM
jwroorda updated this revision to Diff 136805.Mar 2 2018, 11:02 AM
jwroorda marked 2 inline comments as done.
jwroorda set the repository for this revision to rL LLVM.
jwroorda removed rL LLVM as the repository for this revision.Mar 2 2018, 11:07 AM
jwroorda marked an inline comment as done.Mar 2 2018, 11:10 AM
lib/CodeGen/MachinePipeliner.cpp
937 ↗(On Diff #135413)

I see your point. However, if the call is moved inside the DEBUG statement, the statistics information is only updated when debug information is printed. Therefore, instead, I moved the printing of "Invalid node order found!" inside the checkValidNodeOrder function.

This revision was automatically updated to reflect the committed changes.
jwroorda marked an inline comment as done.

This change causes some issues. The following testcase crashes or runs until memory is exhausted.

```define void @f0(i32 %a0) #0 {
b0:
%v0 = ashr i32 %a0, 1
br label %b1

b1:                                               ; preds = %b1, %b0
%v1 = phi i64 [ %v7, %b1 ], [ 0, %b0 ]
%v2 = phi i64 [ %v6, %b1 ], [ undef, %b0 ]
%v3 = phi i64 [ %v8, %b1 ], [ undef, %b0 ]
%v4 = phi i32 [ %v9, %b1 ], [ 0, %b0 ]
%v5 = tail call i64 @llvm.hexagon.S2.shuffeh(i64 undef, i64 %v2)
%v6 = tail call i64 @llvm.hexagon.A2.combinew(i32 undef, i32 undef)
%v7 = tail call i64 @llvm.hexagon.M2.vdmacs.s0(i64 %v1, i64 %v3, i64 %v5)
%v8 = tail call i64 @llvm.hexagon.A2.combinew(i32 undef, i32 undef)
%v9 = add nsw i32 %v4, 1
%v10 = icmp eq i32 %v9, %v0
br i1 %v10, label %b2, label %b1

b2:                                               ; preds = %b1
%v11 = trunc i64 %v7 to i32
%v12 = bitcast i8* undef to i32*
store i32 %v11, i32* %v12, align 4, !tbaa !0
call void @llvm.trap()
unreachable
}

declare i64 @llvm.hexagon.A2.combinew(i32, i32) #1
declare i64 @llvm.hexagon.M2.vdmacs.s0(i64, i64, i64) #1
declare i64 @llvm.hexagon.S2.shuffeh(i64, i64) #1
declare void @llvm.trap() #2

attributes #0 = { nounwind "target-cpu"="hexagonv60" "target-features"="+hvx,+hvx-length64b" }
attributes #1 = { nounwind readnone }
attributes #2 = { noreturn nounwind }

!0 = !{!1, !1, i64 0}
!1 = !{!"int", !2}
!2 = !{!"omnipotent char", !3}
!3 = !{!"Simple C/C++ TBAA"}```

Run with llc -march=hexagon.

This change causes some issues. The following testcase crashes or runs until memory is exhausted.

Thanks for the feedback. I have been able to reproduce the issue.

However, I do not think the change necessarily *causes* the issue. Instead, I believe that it exposes an already existing issue.
The node order generated by the change is valid. I believe that the rest of the SWP-algorithm should be able to handle it.

I did some debugging:
For the given example, after the node order is generated, the pipeliner is able to find a schedule with II=1.
However, when the "orderDependence" function is called from inside the "finalizeSchedule" function,
it gets caught in an infinite recursion. This warrants further investigation.

I did some debugging:
For the given example, after the node order is generated, the pipeliner is able to find a schedule with II=1.
However, when the "orderDependence" function is called from inside the "finalizeSchedule" function,
it gets caught in an infinite recursion. This warrants further investigation.

Thanks for the information. I can take a look at that. I've fixed a couple of issues with similar symptoms in that function. I agree that it's probably not your patch that is the problem...