This is an archive of the discontinued LLVM Phabricator instance.

[MachinePipeliner] Consider only direct path successors when calculating circuit latency
AcceptedPublic

Authored by JamesNagurne on Jul 5 2022, 2:18 PM.

Details

Summary

https://discourse.llvm.org/t/incorrect-recmii-calculations-in-machinepipeliner/63531

The pipeliner's calculation of RecMII depends on the 'Latency' calculated
from each elementary circuit.

Consider the follow example: a DAG with a triange dependence of the form:

1 -> 2, 1 -> 3, 2 -> 3, and 3 -> 1

Assume edges are all weighted 1.

The current calculation for NodeSet Latency checks only that the successor
is in the circuit, and not that the successor is part of the path taken by
the particular elementary circuit.

This presents as the following with the given DAG:

  • Circuit 1 -> 3 correctly calculated as Latency = 1
  • Circuit 1 -> 2 -> 3 incorrectly calculated as Latency = 3 and not 2

The solution here notes that Johnson's algorithm returns the elementary
circuit in the order that the path is taken. This fact is used to ensure
that only edges to the immediate successor in the circuit is checked for
latency.

I've added a common-thread reviewer for a few past commits, but welcome others to add and review themselves.
The subscribers I've added are colleagues as well as past MachinePipeliner committers as interested parties.

As an aside: Are discourse links static and persistent? Or should I remove it from the log here and only reference it in the review?

Diff Detail

Event Timeline

JamesNagurne created this revision.Jul 5 2022, 2:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2022, 2:18 PM
JamesNagurne requested review of this revision.Jul 5 2022, 2:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2022, 2:18 PM
JamesNagurne edited reviewers, added: dmgreen, hgreving; removed: jmolloy.Jul 12 2022, 8:38 AM
JamesNagurne added a subscriber: jmolloy.

Changed reviewers up since I've seen James is seldom active. I've added David and Hendrik as they are the last 2 to have functionally changed things related to the pipeliner.

I am on discord, if there is a desire for more online communication.

1 -> 2, 1 -> 3, 2 -> 3, and 3 -> 1
Assume edges are all weighted 1.
Circuit 1 -> 3 correctly calculated as Latency = 1

1 -> 3 -> 1 should be recurrence length 2, or is the backedge latency 0?

1 -> 2, 1 -> 3, 2 -> 3, and 3 -> 1
Assume edges are all weighted 1.
Circuit 1 -> 3 correctly calculated as Latency = 1

1 -> 3 -> 1 should be recurrence length 2, or is the backedge latency 0?

My understanding was that the upstream doesn't properly figure out latencies of backedges in some cases. It'll return getInstrLatency, which is wrong in our downstream implementation.
If it does skip the PHI and instead calculate the latency along the true back edge, then my mistake. I thought that was something I had to implement downstream as part of a DAG mutation.

Regardless, the RecMII error still applies. The compiler will calculate 1 -> 3 -> 1 as 2, and 1 -> 2 -> 3 -> 1 as 4

Are you working on a backend where writing a test for this is feasible, like Hexagon?

I went through your change and your discourse post and to me this is sound. We're using our own downstream implementation, I am not very familiar with the upstream code. Hoping to find one more reviewer for this.

llvm/include/llvm/CodeGen/MachinePipeliner.h
351

This relies on "notes that Johnson's algorithm returns the elementary circuit in the order that the path is taken". Could you add this as a comment here, maybe even add a note at line 321 to SetVector Node, that Node needs to be ordered (and the SetVector isn't there just to avoid non-determinism).

Can you also add a pointer to the fact that it does reutrn the path in order (I myself for example didn't know this).

Are you working on a backend where writing a test for this is feasible, like Hexagon?

I went through your change and your discourse post and to me this is sound. We're using our own downstream implementation, I am not very familiar with the upstream code. Hoping to find one more reviewer for this.

That's always the problem, isn't it! Our downstream target is not public, and I'm not familiar enough with the current users' ISAs and implementation to create a targeted case that would exhibit this problem/solution.

  • This is a small, albeit important part of software pipelining, and so pushing a DAG through construction, postprocess, mutation, scheduling, and expansion before anything can be tested in lit sounds difficult
  • Being a Machine pass, it'd be nice to have a test for everyone, but this change really is generic. I don't have the expertise to go write for upstream Hexagon, PowerPC, and ARM.

So my thought process leads to a couple possibilities:

  • Is there a way to unit-test such things? It would be possible, albeit annoying to include this module and then synthesize the NodeSet to ensure the ii is expected.
  • Is there any prior art to using some sort of debug/opt remark output to test things like this? We could emit Res/Rec calculations as part of the pass, and use that to make testing less dependent on code generation

As far as review, I had hoped @dmgreen would additionally check off on it as the team at ARM have been making changes and enabling the pass for some subtargets.

Circuit 1 -> 2 -> 3 incorrectly calculated as Latency = 3 and not 2

Why should it be 2? There are three edges on the critical path (1 -> 2, 2 -> 3 and 3 -> 1), so the latency is 3.

By the way, there are other issues with the current implementation. From the top of my head:

  • getDistance does not look through PHIs, hence the value returned never exceeds 1. This may needlessly increase node functions (ASAP, ALAP etc.).
  • canReserveResources and reserveResources do not align in the way they work with ProcResourceCount. This may cause a compiler crash.
  • Calculation of latency between two instructions connected through a PHI does not involve scheduling info; the computed latency is always one or zero (don't remember which one exactly).
  • Finally, the schedule the pipeliner carefully crafted is destroyed by the subsequent MI scheduler, which makes it less useful for VLIW in-order targets.

There is a room for improvement, but thanks-to-out-of-order-execution few people care.

A workaround for your issue is simple - you just don't call calculateRecMII and estimate MII yourself. You may as well force it to 1 and it will be safe because MII is just a hint for the scheduler. This will affect compilation time, of course.

Excuse my English

Circuit 1 -> 2 -> 3 incorrectly calculated as Latency = 3 and not 2

Why should it be 2? There are three edges on the critical path (1 -> 2, 2 -> 3 and 3 -> 1), so the latency is 3.

As I said in the previous comment, I could be wrong, but I understood the algorithm as not properly calculating the latency of the edge from node 3 to the PHI at the top of the loop, so I did not add that value. Regardless, if that edge is considered, and is given a value of 1, the calculation in the currently upstreamed code will give a latency of 4, not 3, due to the latency of edge 1 -> 3 being erroneously considered.

llvm/include/llvm/CodeGen/MachinePipeliner.h
351

I'll do

By the way, there are other issues with the current implementation. From the top of my head:

  • getDistance does not look through PHIs, hence the value returned never exceeds 1. This may needlessly increase node functions (ASAP, ALAP etc.).

Interesting point. I wonder if we've made changes here on our end... Not related to this change, but something to note.

  • canReserveResources and reserveResources do not align in the way they work with ProcResourceCount. This may cause a compiler crash.

Our downstream implementation uses a custom implementation for resource allocation, so this hasn't been a major issue.

  • Calculation of latency between two instructions connected through a PHI does not involve scheduling info; the computed latency is always one or zero (don't remember which one exactly).

This was the issue I commented on in both of your comments re: including the back edge of the circuit in the Latency calculations for NodeSet. Our backend 'skips' the PHI and calculates the operand latency across the backedge to the true use.

  • Finally, the schedule the pipeliner carefully crafted is destroyed by the subsequent MI scheduler, which makes it less useful for VLIW in-order targets.

We recently resolved this by implementing a post-expansion insertion of scheduling barriers. Each modulo cycle is considered its own region and is, therefore, not reordered. There's some magic in the post-RA scheduler that undoes this so that we can actually schedule the whole kernel, but I digress.

There is a room for improvement, but thanks-to-out-of-order-execution few people care.

A workaround for your issue is simple - you just don't call calculateRecMII and estimate MII yourself. You may as well force it to 1 and it will be safe because MII is just a hint for the scheduler. This will affect compilation time, of course.

It's a good workaround for sure, but I believe this is a viable and correct fix to a real bug that may impact non-Hexagon users. I am, however, willing to reconsider if there is real issue with the change.

Excuse my English

Honestly? I didn't even notice :)

This was the issue I commented on in both of your comments re: including the back edge of the circuit in the Latency calculations for NodeSet. Our backend 'skips' the PHI and calculates the operand latency across the backedge to the true use.

I think I was talking about something different, most probably updatePhiDependences.
As you can see there, the Anti dependence created in this function gets the latency 1, not considering the latency of the instruction which feeds the PHI. Similarly, instructions which depend on PHIs get latency 0. This roughly means that the latency computed between the real instructions connected through a PHI node is 1, which in many cases far from accurate.

We recently resolved this by implementing a post-expansion insertion of scheduling barriers. Each modulo cycle is considered its own region and is, therefore, not reordered. There's some magic in the post-RA scheduler that undoes this so that we can actually schedule the whole kernel, but I digress.

Interesting approach, thanks for sharing this! We've been thinking about storing the computed schedule somewhere outside the pipeliner pass (e.g. metadata), disable both pre- and post-RA schedulers for pipelined loops and then use the recorded schedule post-RA to form instruction bundles. But we've never been able to resolve issues with copies / spills inserted by the register allocator. They are inserted in-between scheduled "regions" which adds extra cycles. While it is possible to try to fold the inserted instructions into the nearest bundle, in general it would require at least partial rescheduling to avoid hazards / stalls. Don't know how it all ended, it has been some time since I left the project.

It's a good workaround for sure, but I believe this is a viable and correct fix to a real bug that may impact non-Hexagon users. I am, however, willing to reconsider if there is real issue with the change.

I'll try to give it a closer look.

Excuse my English

Honestly? I didn't even notice :)

Oh, really? This is inspiring to say the least, thank you!

hgreving added a comment.EditedJul 19 2022, 12:16 PM

This was the issue I commented on in both of your comments re: including the back edge of the circuit in the Latency calculations for NodeSet. Our backend 'skips' the PHI and calculates the operand latency across the backedge to the true use.

I think I was talking about something different, most probably updatePhiDependences.
As you can see there, the Anti dependence created in this function gets the latency 1, not considering the latency of the instruction which feeds the PHI. Similarly, instructions which depend on PHIs get latency 0. This roughly means that the latency computed between the real instructions connected through a PHI node is 1, which in many cases far from accurate.

We recently resolved this by implementing a post-expansion insertion of scheduling barriers. Each modulo cycle is considered its own region and is, therefore, not reordered. There's some magic in the post-RA scheduler that undoes this so that we can actually schedule the whole kernel, but I digress.

We recently resolved this by implementing a post-expansion insertion of scheduling barriers. Each modulo cycle is considered its own region and is, therefore, not reordered. There's some magic in the post-RA scheduler that undoes this so that we can actually schedule the whole kernel, but I digress.

Ideally (I am assuming you're working on a VLIW architecture) LLVM should support bundling pre-RA, until then, a downstream target needs to use workarounds like this (we're doing something similar). Both pre-RA and post-RA scheduling doesn't make sense to run on blocks that were pipelined.

Interesting approach, thanks for sharing this! We've been thinking about storing the computed schedule somewhere outside the pipeliner pass (e.g. metadata), disable both pre- and post-RA schedulers for pipelined loops and then use the recorded schedule post-RA to form instruction bundles. But we've never been able to resolve issues with copies / spills inserted by the register allocator. They are inserted in-between scheduled "regions" which adds extra cycles. While it is possible to try to fold the inserted instructions into the nearest bundle, in general it would require at least partial rescheduling to avoid hazards / stalls. Don't know how it all ended, it has been some time since I left the project.

It's a good workaround for sure, but I believe this is a viable and correct fix to a real bug that may impact non-Hexagon users. I am, however, willing to reconsider if there is real issue with the change.

I'll try to give it a closer look.

I think the fix is prob good as long as the assumption of the path order of the Johnson algorithm is correct. It's also better to underestimate RecII rather than over estimate. Important to notice that LLVM's upstream implementation of finding the cycle length includes the nodes that are part of the recurrence, because swing scheduling relies on those sets to be prioritized. Other pipeline algorithms only require the length of a recurrence but not the nodes that are part of it. Finding the MII for recurrence can be implemented much faster with a compute minimum distance matrix than the upstream implementation (though again, swing does need it).

Excuse my English

Honestly? I didn't even notice :)

Oh, really? This is inspiring to say the least, thank you!

JamesNagurne added inline comments.Jul 19 2022, 1:39 PM
llvm/include/llvm/CodeGen/MachinePipeliner.h
351

I couldn't find any specific "It returns this order", but there are a number of hints:

I'm using only a copy of the paper I found at https://www.cs.tufts.edu/comp/150GA/homeworks/hw1/Johnson%2075.PDF

First, definitions:

  • A path is a defined as a set of directed jumps from one node to an adjacent node in a graph
  • A circuit is a path whose start and end nodes are the same
  • An elementary circuit is a circuit which never jumps from the same node twice.
  • Johnson's circuit algorithm finds elementary circuits

The algorithm uses the concept of a 'root vertex s', which is defined as a least vertex through some definition of lesser/greater. In the code, this ordering is defined by SUnit's NodeNum field, which is a good approximation for ordering.

The algorithm then repeatedly appends to the path from s to some adjacent node n, and then uses n as the starting point for the next iteration of the algorithm. Since each step jumps from a previously selected, adjacent node to the next, I don't see how it could be out of order. This is further reinforced because each node is guaranteed to be "greater" than the last.

Added comment detailing the expectation of iterators 'S' and 'E' to the Latency calculations.

It's interesting to note that a NodeSet is manipulated quite often after construction, such as when accumulating instructions not part of a circuit. It's only when calculation Latency that the ordering of the circuit must hold.

barannikov88 accepted this revision.Jul 20 2022, 12:03 AM

LGTM
Thank you for the detailed explanations here and on discourse, it made the issue clear and the review much easier.
The solution seems right, though it might be fragile as it relies on a particular implementation of circuits-finding algorithm.
One way to make it more reliable is to move the latency calculation into a private method SwingSchedulerDAG::Circuits, leaving the constructor of NodeSet trivial.

llvm/include/llvm/CodeGen/MachinePipeliner.h
351

is a bit shorter and follows the general "modulo" idea :)
Not a strong preference, of course.

This revision is now accepted and ready to land.Jul 20 2022, 12:03 AM
hgreving accepted this revision.Jul 20 2022, 7:56 AM

LGTM

llvm/include/llvm/CodeGen/MachinePipeliner.h
359

You could add an assertion that at least one of the successors should get here and we should never skip all successors, which would imply that sth with the nodes order is unexpected.

JamesNagurne marked an inline comment as done.

Apologies for "one last" revision, but I wanted to make sure the review reflected my intended commit.

I've incorporated the "in-spirit" modulo operation, as well as a check to ensure the circuit path is in order.

I think there is some very good merit in calculating Latency as part of the ::circuit algorithm, but did not want to bite that off just yet.

I think there is some very good merit in calculating Latency as part of the ::circuit algorithm, but did not want to bite that off just yet.

Fine with me, please proceed.

hgreving accepted this revision.Jul 21 2022, 2:01 PM
JamesNagurne added inline comments.Jul 21 2022, 3:28 PM
llvm/include/llvm/CodeGen/MachinePipeliner.h
359

@hgreving Thank you for planting the seed that I should be checking my assumption more rigorously.
It is indeed the case that the algorithm doesn't find a backedge successor in a given elementary circuit.

The upstream pipeliner has '::addLoopCarriedDependences', which creates "chain edges" (which I can't find a good def. for) from a load to a store.

These edges are specifically called out by ::createAdjacencyStructure and added to the structure. This results in a potential elementary circuit:

Load -> Op -> Store
  \-------------^

Note that there is no edge from Store to Load, only a special Order-Barrier edge from Load to Store (see ::isLoopCarriedDep).

I now believe there are 2 errors:

  1. The original, where the path isn't being followed and extra edges are being added to the Latency calculation. I've fixed this.
  2. The just-uncovered error, where there exists some backedges in a node's predecessor list, which is not being checked at all

My suggestion is to commit upstream the fix for 1, and then table a fix for 2 until I have more time to look into it.
That is, I will no longer assert MaxEdge, and will instead add 0, as the algorithm would have before my update. I will make a comment detailing this analysis as well.

barannikov88 added inline comments.Jul 21 2022, 5:00 PM
llvm/include/llvm/CodeGen/MachinePipeliner.h
359

@JamesNagurne

The upstream pipeliner has '::addLoopCarriedDependences', which creates "chain edges" (which I can't find a good def. for) from a load to a store.

In case that helps:

LoopBB:
  %1 = PHI %2, LoopBB, ...
  %i = PHI %i, LoopBB, %i_init, PHBB
       ST %1, [%ptr + %i + 1]
  %2 = LD [%ptr + %i]
  %i = ADD %i, 1

In the initial DAG (built by buildSchedGraph) there is Anti dependence LD -> PHI (PHI depends on LD) and Data dependence ST -> PHI (ST depends on PHI).
Note that the direction of dependency between LD and PHI is reversed. This is due to the DAG nature - it cannot contain cycles, because, well, 'A' stands for Acyclic.

Note that LD and ST access different memory locations, so there are independent of each other. Also note that the store uses the value defined by the load on the previous iteration. From the point of view, they are dependent. This dependency is called "loop-carried".

The loop-carried dependency does not exist in the initial DAG as it is not needed for the main user of buildSchedGraph -- MachineScheduler -- it can freely reorder these operations because there are still virtual registers. The worst case scenario here is that register allocator will have to insert a COPY instruction due to overlapping lifetimes of %1 and %2.

But this kind of dependence is important for pipeliner for reasons I don't clearly remember. So it adds an edge between the load and the store so that they are not reordered, but only if it could not prove that the two operations from the current and the next iteration do not alias (in the above example they do).

The "chain" here has exact same meaning as in SelectionDAG -- it enforces order between two nodes which are otherwise independent; it effectively is an "order" dependence, with "scheduling barrier" flavor (see enum OrderKind).

If you are interested, I could try to recall why adding this edge is necessary.

barannikov88 accepted this revision.Jul 21 2022, 6:47 PM
hgreving added inline comments.Jul 21 2022, 7:24 PM
llvm/include/llvm/CodeGen/MachinePipeliner.h
359

@hgreving Thank you for planting the seed that I should be checking my assumption more rigorously.
It is indeed the case that the algorithm doesn't find a backedge successor in a given elementary circuit.

The upstream pipeliner has '::addLoopCarriedDependences', which creates "chain edges" (which I can't find a good def. for) from a load to a store.

These edges are specifically called out by ::createAdjacencyStructure and added to the structure. This results in a potential elementary circuit:

Load -> Op -> Store
  \-------------^

Note that there is no edge from Store to Load, only a special Order-Barrier edge from Load to Store (see ::isLoopCarriedDep).

I now believe there are 2 errors:

  1. The original, where the path isn't being followed and extra edges are being added to the Latency calculation. I've fixed this.
  2. The just-uncovered error, where there exists some backedges in a node's predecessor list, which is not being checked at all

My suggestion is to commit upstream the fix for 1, and then table a fix for 2 until I have more time to look into it.
That is, I will no longer assert MaxEdge, and will instead add 0, as the algorithm would have before my update. I will make a comment detailing this analysis as well.

359

@hgreving Thank you for planting the seed that I should be checking my assumption more rigorously.
It is indeed the case that the algorithm doesn't find a backedge successor in a given elementary circuit.

The upstream pipeliner has '::addLoopCarriedDependences', which creates "chain edges" (which I can't find a good def. for) from a load to a store.

These edges are specifically called out by ::createAdjacencyStructure and added to the structure. This results in a potential elementary circuit:

Load -> Op -> Store
  \-------------^

Note that there is no edge from Store to Load, only a special Order-Barrier edge from Load to Store (see ::isLoopCarriedDep).

I now believe there are 2 errors:

  1. The original, where the path isn't being followed and extra edges are being added to the Latency calculation. I've fixed this.
  2. The just-uncovered error, where there exists some backedges in a node's predecessor list, which is not being checked at all

My suggestion is to commit upstream the fix for 1, and then table a fix for 2 until I have more time to look into it.
That is, I will no longer assert MaxEdge, and will instead add 0, as the algorithm would have before my update. I will make a comment detailing this analysis as well.

359

Are you saying that Johnson did not return the nodes of the circuit in the order taken but the edge is in the Nodes list? I think if it generally does, minus some rare cases, it's still ok, since it is ok to undercount once in a while. Especially for small loops you don't want to over estimate the recurrence II.

That is, I will no longer assert MaxEdge, and will instead add 0, as the algorithm would have before my update

Didn't follow this one, how would your new code look like?

By the way, I would disregard the kind of edges and what they mean in a particular case, since every target can do its own thing, in particular with respect to its memory model. I would focus on making sure that the patch doesn't break recurrence finding for obvious cases causing compile time increase, is what I was worried about.

JamesNagurne added inline comments.Jul 21 2022, 7:45 PM
llvm/include/llvm/CodeGen/MachinePipeliner.h
359

Johnson's algorithm did return a circuit, but that circuit is not identified by the NodeSet Latency constructor. This 'chain' edge is not reversed like anti-edges are, so it's not in the Succs list.

The constructor would have to additionally check all predecessors for these special Load -> Store edges the pipeliner creates to find that extra cycle.

I agree (and so do my colleagues) that this is not a big deal. Being too low is much better than being too high.

Didn't follow this one, how would your new code look like?

-     assert(MaxEdge != nullptr && "Expected path successor in circuit");
-     Latency += MaxEdge->getLatency();
+     Latency += ((MaxEdge == nullptr) ? 0 : MaxEdge->getLatency());

With an added FIXME/comment describing this conversation.