Page MenuHomePhabricator

[LoopFission]: Loop Fission Interference Graph (FIG)
Needs ReviewPublic

Authored by etiotto on Jan 31 2020, 12:43 PM.

Details

Summary

SUMMARY
The Loop Fission Interference Graph (FIG) is generated with the nodes of the data dependence graph (DDG). Weights are generated for the edges of the interference graph reflecting the affinity between statements represented by the nodes joined by the edges. Nodes in the interference graph are given weights reflecting resource usage by the statements associated with the nodes. The interference graph is partitioned using a profitability test based on the weights of edges and nodes in the data dependence graph. Code (e.g. loops) is emitted based on the partitioned interference graph.

PATCH DESCRIPTION
In this first patch we introduce the basic components of the Loop Fission Interference Graph:

  • the FIGNode (and derived class) representing nodes in the graph
  • and the FIGEdge (and its derived classes) representing edges in the graph

The graph is constructed based on the DDG for a loop. Each interference graph node encapsulates one or more DDG nodes.
Edges connect the nodes in the interference and are used to enforce data dependencies.
A unit test is provided to test the graph structure and ensure nodes are connected properly.

FUTURE PATCHES
In future patches we plan to:

  • add the ability to merge nodes in the interference graph
  • summarize node connectivity informations and use the the information to prevent merging nodes that would create cycles in the graph

Merged nodes will represent the largest unit of code that can be distributed into a loop without violating data dependencies.

Afterward we will start to add to the graph the infrastructure needed to develop a cost model which will be used to determine which nodes should be merged to achieve data reuse.
Nodes found to exhibit data reuse will be connected by affinity edges with a weight that represent the desirability to merge them.
Nodes will also be given a set of Resources to represent the computational resource the code in the node consumes (e.g. code size, HW streams, etc...).

Diff Detail

Event Timeline

etiotto created this revision.Jan 31 2020, 12:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 31 2020, 12:43 PM
etiotto updated this revision to Diff 241797.Jan 31 2020, 1:05 PM

[not a review] The established term in the code base for the transformation is (Loop-)Distribution. Could we use the same term here?

I choose to name the pass Loop Fission to avoid confusion with the existing LoopDistribution pass (eventually Loop Fission should replace that pass). Also Loop Fission is the opposite of Loop Fusion :-) ?

The FIG recreate a dependency graph from the DDG. I strongly suggest to traverse the nodes/edges from the DDG instead of iterating over all pairs of nodes and asking the DDG 'does this edge exist?' That's supposed to be done just once (by the DDG) and I hope it can be optimized later to reduce the burden of O(n^2). Maybe even derive from the DDG and only change the parts that are difference in the FIG.

Otherwise, I don't see that many differences between the DDG and FIG with a subset of dependencies (and later added affinity 'dependencies'). Why can other dependency (DDGEdge::Unknown) just be ignored?

llvm/include/llvm/Analysis/DDG.h
450–451

[style] LLVM does not use almost-always-auto

llvm/include/llvm/Transforms/Scalar/LoopFission/FIG.h
113

AFICS each FIG node is backed by exactly one DDGNode. Why is it a SmallPtrSet?

137

[typo] Intereference

222

[serious] With "update", I'd expect Weight = W (and maybe some updating logic, it would be setWeight otherwise). Suggestion: "addWeight".

408

[typo] grapth (copy-and-pasted from DDG.h)

451

[typo] grapth

llvm/lib/Transforms/Scalar/CMakeLists.txt
31

[serious] It is unusual to place component files into subdirectory and I don't see a reason to deviate from established practice. Maybe put it into the LLVMAnalysis like the DDG?

llvm/lib/Transforms/Scalar/LoopFission/FIG.cpp
22

In release mode, I get a warning warning: ‘VerboseDebug’ defined but not used.

106

[serious] This calls DependenceInfo::depends (a really expensive call) again on each edge after DDG already has done so. Maybe cache in the DDG?

108

[serious] I think being an input dependence does not, in principle, exclude also being another type of dependence (e.g. a dependence between function calls that read and write (conditionally) a location). Maybe isOrdered() is what you are looking for.

IIUC, DependenceInfo does not even report a dependence if it is input-only (and make me wonder why isInput even exists).

124

This is weird: ~FIGEdge is defined as abstract (=0), but there is an implementation here. Seems to work in that it makes the base class abstract, but still allows derived classed to call the dtor.

https://stackoverflow.com/questions/24316700/c-abstract-class-destructor#answer-24317666

139

[suggestion] Also assert that no Weight is set for the other types.

318

[style] constexpr?

Suggestion: haveMemoryDependence(N1, N2, DDG, /*SkipInputDependencies=*/true)

424–425

[serious] This is at least O(n^2)

This iterates over all pairs of nodes
  then over all outgoing edges of the src node
    to find a match in the underlying DDG
    and potentially calls `getDependencies` for each edge

For nodes with multiple outgoing edges (=d), this calls getDependencies O(d^2) times on the same edge for some information that the DDG already queried.

To summarize, this code does not scale.

531

[style] RN is unused; use isa<RootFIGNode>

Do you plan to support lexical forward dependencies?

fhahn added a comment.Feb 18 2020, 4:07 AM

I think it would be great if the patch that uses the infrastructure added here would also be available for review, to get a better idea how it is actually used. It also makes testing much easier.

I choose to name the pass Loop Fission to avoid confusion with the existing LoopDistribution pass (eventually Loop Fission should replace that pass). Also Loop Fission is the opposite of Loop Fusion :-) ?

I think adding this as separate pass with a slightly different name also causes a bunch of confusion and potentially split the development effort, in case people improve the existing LoopDistribute. Is there a reason for adding a completely new pass, rather than working on improving/replacing the existing LoopDistribute? I am a bit worried that in the end we end up with 2 passes that are not used/enabled. Would it not be possible/easier to add an option to use the FIG for partition in LoopDistribute and incrementally improve LoopDistribute?

Meinersbur added inline comments.Feb 26 2020, 8:59 AM
llvm/lib/Transforms/Scalar/LoopFission/FIG.cpp
108

Seems I was wrong here and DependenceInfo::depends does return input dependencies. The check is only !(Src->mayReadOrWriteMemory() && Dst->mayReadOrWriteMemory()), i.e. whether some memory is accesses at all, not what kind of access it is.

bmahjour added inline comments.Feb 26 2020, 9:48 AM
llvm/lib/Transforms/Scalar/LoopFission/FIG.cpp
106

The DDG would have to search through every pair of instructions in the loop. The set of instructions being iterated here is much more reduced than what the DDG goes through, because we only consider instructions that actually have a memory dependence between them (previously detected). On the other hand, depends is more expensive to compute when there is actual memory dependencies so the concern you raise is still valid.

When designing the DDG we opted for lazy evaluation of dependencies instead of caching the results due to memory consumption concerns. I think we should revisit these trade offs at some point. I'm just not sure this is the right time, given that the algorithms and the implementation is being changed, so any data we collect could go stale and become irrelevant very quickly.

Question regarding input dependencies inherited from the DDG:

for (int i = 0; i < b; ++i) {
  B[i] = f(A[i-1] + A[i]);
  C[i] = g(A[i]);
}

IIUC, the DDG will contain the input dependencies B->C and C->B (loop-carried). Since this results in a cycle, those will be combined in a Pi-node. The FIG treats Pi-nodes atomically, hence will not be able to fission/distribute the loop. Is this correct? Having read-read dependencies in the DDG still feels strange to me.

llvm/lib/Transforms/Scalar/LoopFission/FIG.cpp
106

Did you consider adding a TODO in the source about low hanging fruits to improve performance?

Question regarding input dependencies inherited from the DDG:

for (int i = 0; i < b; ++i) {
  B[i] = f(A[i-1] + A[i]);
  C[i] = g(A[i]);
}

IIUC, the DDG will contain the input dependencies B->C and C->B (loop-carried). Since this results in a cycle, those will be combined in a Pi-node. The FIG treats Pi-nodes atomically, hence will not be able to fission/distribute the loop. Is this correct? Having read-read dependencies in the DDG still feels strange to me.

The dependence graph builder only does edge reversal on ordered dependencies, and since input dependencies are not ordered the above example will not cause a cycle since all input dependence edges flow from the first statement to the second (there will be no back edge).
While I think input dependencies can cause cycles in theory, I don't think they'll be common in our LLVM implementation because the dependencies are modeled at a higher granularity than statement level. I tried to engineer an example that would introduce a cycle with input dependency being a critical edge, but I failed. If you have such an example it would be interesting to look at.
Input dependencies may be uninteresting to loop fission (at least for enforcing legality constraints) but would be useful in other cases such as modeling data reuse. In fact loop fission could also use those edges to establish weights on the affinity edges. For that reason I'd argue there is value in keeping them in the DDG, even if it means giving up on some rare opportunities.
Another possibility might be to have two flavors of the DDG; one that models the input dependencies and one that doesn't, assuming that a motivating case for loop fission can be found.

Do you plan to support lexical forward dependencies?

Those should already be covered by the DDG.

I tried to engineer an example that would introduce a cycle with input dependency being a critical edge, but I failed. If you have such an example it would be interesting to look at.

for (int i = ...) {
  p = A[i-1];
  x = load *p;    

  y = load *q;
  A[i] = y;
}

Lets assume AA knows that A and {q,p} do not alias (e.g. TBAA). It is legal to distribute to

for (int i = ...) {
  y = load *q;
  A[i] = y;
}
for (int i = ...) {
  p = A[i-1];
  x = load *p;    
}

There is a dependency chain y = load *q; ->A[i] = y; -> p = A[i-1]; -> x = load *p;. The input dependency x = load *p; -> y = load *q; closes the cycle. It should be even easier to do with instructions that read and write at the same time (e.g. call to memcpy), but DependencyInfo does not support them.

What's even weirder is that this depends on the instruction order on the IR. If we reorder x = load *p and y = load *q, and (IIUC) DDG only adds unordered (= input) dependencies in forward direction, we get the dependency y = load *q; -> x = load *p;, which is transitively redundant.

I strongly feel that input dependencies should not be part of the DDG.