[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation.

This patch teaches the bottleneck analysis how to identify and print the most

expensive sequence of instructions according to the simulation. Fixes PR37494.

The goal is to help users identify the sequence of instruction which is most

critical for performance.

A dependency graph is internally used by the bottleneck analysis to describe

data dependencies and processor resource interferences between instructions.

There is one node in the graph for every instruction in the input assembly

sequence. The number of nodes in the graph is independent from the number of

iterations simulated by the tool. It means that a single node of the graph

represents all the possible instances of a same instruction contributed by the

simulated iterations.

Edges are dynamically "discovered" by the bottleneck analysis by observing

instruction state transitions and "backend pressure increase" events generated

by the Execute stage. Information from the events is used to identify critical

dependencies, and materialize edges in the graph. A dependency edge is uniquely

identified by a pair of node identifiers plus an instance of struct

DependencyEdge::Dependency (which provides more details about the actual

dependency kind).

The bottleneck analysis internally ranks dependency edges based on their impact

on the runtime (see field DependencyEdge::Dependency::Cost). To this end, each

edge of the graph has an associated cost. By default, the cost of an edge is a

function of its latency (in cycles). In practice, the cost of an edge is also a

function of the number of cycles where the dependency has been seen as

'contributing to backend pressure increases'. The idea is that the higher the

cost of an edge, the higher is the impact of the dependency on performance. To

put it in another way, the cost of an edge is a measure of criticality for

performance.

Note how a same edge may be found in multiple iteration of the simulated loop.

The logic that adds new edges to the graph checks if an equivalent dependency

already exists (duplicate edges are not allowed). If an equivalent dependency

edge is found, field DependencyEdge::Frequency of that edge is incremented by

one, and the new cost is cumulatively added to the existing edge cost.

At the end of simulation, costs are propagated to nodes through the edges of the

graph. The goal is to identify a critical sequence from a node of the root-set

(composed by node of the graph with no predecessors) to a 'sink node' with no

successors. Note that the graph is intentionally kept acyclic to minimize the

complexity of the critical sequence computation algorithm (complexity is

currently linear in the number of nodes in the graph).

The critical path is finally computed as a sequence of dependency edges. For

edges describing processor resource interferences, the view also prints a

so-called "interference probability" value (by dividing field

DependencyEdge::Frequency by the total number of iterations).

Examples of critical sequence computations can be found in tests added/modified

by this patch.

On output streams that support colored output, instructions from the critical

sequence are rendered with a different color.

Strictly speaking the analysis conducted by the bottleneck analysis view is not

a critical path analysis. The cost of an edge doesn't only depend on the

dependency latency. More importantly, the cost of a same edge may be computed

differently by different iterations.

The number of dependencies is discovered dynamically based on the events

generated by the simulator. However, their number is not fixed. This is

especially true for edges that model processor resource interferences; an

interference may not occur in every iteration. For that reason, it makes sense

to also print out a "probability of interference".

By construction, the accuracy of this analysis (as always) is strongly dependent

on the simulation (and therefore the quality of the information available in the

scheduling model).

That being said, the critical sequence effectively identifies a performance

criticality. Instructions from that sequence are expected to have a very big

impact on performance. So, users can take advantage of this information to focus

their attention on specific interactions between instructions.

In my experience, it works quite well in practice, and produces useful

output (in a reasonable amount time).

Differential Revision: https://reviews.llvm.org/D63543