This is an archive of the discontinued LLVM Phabricator instance.

[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation.
ClosedPublic

Authored by andreadb on Jun 19 2019, 5:00 AM.

Details

Summary

This patch teaches the bottleneck analysis how to identify and print the most
expensive sequence of instructions according to the simulation. Fixes PR37494.

The goal is to help users identify the sequence of instruction which is most
critical for performance.

A dependency graph is internally used by the bottleneck analysis to describe data
dependencies and processor resource interferences between instructions.

There is one node in the graph for every instruction in the input assembly
sequence. The number of nodes in the graph is independent from the number of
iterations simulated by the tool. It means that a single node of the graph
represents all the possible instances of a same instruction contributed by
the simulated iterations.

Edges are dynamically "discovered" by the bottleneck analysis by observing
instruction state transitions and "backend pressure increase" events generated
by the Execute stage. Information from the events is used to identify critical
dependencies, and materialize edges in the graph. A dependency edge is uniquely
identified by a pair of node identifiers plus an instance of struct
DependencyEdge::Dependency (which provides more details about the
actual dependency kind).

The bottleneck analysis internally ranks dependency edges based on their impact
on the runtime (see field DependencyEdge::Dependency::Cost). To this end, each
edge of the graph has an associated cost. By default, the cost of an edge is a
function of its latency (in cycles). In practice, the cost of an edge is also a
function of the number of cycles where the dependency has been seen as
'contributing to backend pressure increases'. The idea is that the higher the
cost of an edge, the higher is the impact of the dependency on performance. To put
it in another way, the cost of an edge is a measure of criticality for performance.

Note how a same edge may be found in multiple iteration of the
simulated loop. The logic that adds new edges to the graph checks if an equivalent
dependency already exists (duplicate edges are not allowed). If an equivalent
dependency edge is found, field DependencyEdge::Frequency of that edge is
incremented by one, and the new cost is cumulatively added to the existing edge cost.

At the end of simulation, costs are propagated to nodes through the edges of the
graph. The goal is to identify a critical sequence from a node of the root-set
(composed by node of the graph with no predecessors) to a 'sink node' with no
successors.
Note that the graph is intentionally kept acyclic to minimize the complexity of
the critical sequence computation algorithm (complexity is currently linear in the
number of the graph).

The critical path is finally computed as a sequence of dependency edges. For edges
describing processor resource interferences, the view also prints a
so-called "interference probability" value (by dividing field DependencyEdge::Frequency
by the total number of iterations).

Examples of critical sequence computations can be found in tests added/modified
by this patch.

On output streams that support colored output, instructions from the critical
sequence are rendered with a different color.

Strictly speaking the analysis conducted by the bottleneck analysis view is not
a critical path analysis. The cost of an edge doesn't only depend on the
dependency latency. More importantly, the cost of a same edge may be computed
differently by different iterations.

The number of dependencies is discovered dynamically based on the events
generated by the simulator. However, their number is not fixed. This is
especially true for edges that model processor resource interferences; an
interference may not occur in every iteration. For that reason, it makes
sense to also print out a "probability of interference".

By construction, the accuracy of this analysis (as always) is strongly dependent
on the simulation (and therefore the quality of the information available in the
scheduling model).

That being said, the critical sequence effectively identifies a performance
criticality. Instructions from that sequence are expected to have a very big
impact on performance. So, users can take advantage of this information to focus
their attention on specific interactions between instructions.
In my experience, it works quite well in practice, and produces useful
output (in a reasonable amount time). So, I'd like to know your opinion on this.

-Andrea

Diff Detail

Event Timeline

andreadb created this revision.Jun 19 2019, 5:00 AM

Nice! Thank you for working on this!

A few minors

test/tools/llvm-mca/X86/SkylakeClient/bottleneck-analysis.s
60

(very pedantic) Fix the Dependency Information column indentation to be consistent?

tools/llvm-mca/Views/BottleneckAnalysis.cpp
185

Comments describing whats going on?

210

Comments?

tools/llvm-mca/Views/BottleneckAnalysis.h
190–191

A lot of this is independent of the patch - precommit cleanup?

andreadb marked 3 inline comments as done.Jun 19 2019, 6:17 AM
andreadb added inline comments.
test/tools/llvm-mca/X86/SkylakeClient/bottleneck-analysis.s
60

Very strange.

I suspect it probably has to do with the fact that update_mca_check script might have replaced tabs with spaces.
MCInstPrinter inserts tabs between opcode and operand list. Those don't seem to be preserved by the check script, so you end up with that odd check line.
When printing the dependency information, I already "pad-to-column". So, to me that must have to do with the script that generates check lines.
(It appears correctly aligned on my terminal).

tools/llvm-mca/Views/BottleneckAnalysis.cpp
210

I will add a description of the algorithm here.

tools/llvm-mca/Views/BottleneckAnalysis.h
190–191

Sorry. The data dependency graph was pre-committed to "simplify" this patch.
In retrospect I am not sure if that was a wise choice. So, apologies in case.

I guess, the least that I can do is to add more code comments. Let me know...

ormris added a subscriber: ormris.Jun 19 2019, 11:04 AM
mattd accepted this revision.Jun 19 2019, 10:02 PM

Awesome patch. I'm cool with this as long as Simon's comments are addressed.

tools/llvm-mca/Views/BottleneckAnalysis.cpp
174

Suggestion:

for (unsigned I : llvm::seq<unsigned>(0, Nodes.size())) {...}
351

Perhaps just "Unsupported dependency type!" Is there a case where we have an expected unsupported dependency type? If so then ignore my suggestion :)

This revision is now accepted and ready to land.Jun 19 2019, 10:02 PM

Awesome patch. I'm cool with this as long as Simon's comments are addressed.

Thanks Matt for the review.

I will soon send an updated patch that addresses Simon's comments.
I plan to wait for a few days to give more time to review the patch and collect extra feedback (in case).

I forgot to mention in the summary that this patch doesn't include an llvm-documentation change.
The new bottleneck analysis probably requires its own sub-section in the llvm-mca documentation.

If reviewers are okay with it, I can do the documentation change in a follow-up patch.
In the meantime, if people want, I can raise a bug to add the bottleneck analysis description in the docs.

Docs additions as a follow up patch is fine.

andreadb updated this revision to Diff 205848.Jun 20 2019, 9:51 AM
andreadb marked 3 inline comments as done.

Patch updated.

Addressed review comments.
Added a description of the algorithm.
Improved description of the bottleneck analysis view (I plan to reuse most of it in a follow-up patch that improves the docs).

RKSimon accepted this revision.Jun 20 2019, 12:58 PM

awesome - LGTM

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2019, 6:32 AM