This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] Opcode stabilization / reclusterization (PR40715)
ClosedPublic

Authored by lebedev.ri on Feb 18 2019, 8:04 AM.

Details

Summary

Given an instruction Opcode, we can make benchmarks (measurements) of the
instruction characteristics/performance. Then, to facilitate further analysis
we group the benchmarks with *similar* characteristics into clusters.
Now, this is all not entirely deterministic. Some instructions have variable
characteristics, depending on their arguments. And thus, if we do several
benchmarks of the same instruction Opcode, we may end up with *different*
performance characteristics measurements. And when we then do clustering,
these several benchmarks of the same instruction Opcode may end up being
clustered into *different* clusters. This is not great for further analysis.

We shall find every Opcode with benchmarks not in just one cluster, and move
*all* the benchmarks of said Opcode into one new unstable cluster per Opcode.

I have solved this by making ClusterId a bit field, adding a IsUnstable bit,
and introducing -analysis-display-unstable-clusters switch to toggle between
displaying stable-only clusters and unstable-only clusters.

The reclusterization is deterministically stable, produces identical reports
between runs. (Or at least that is what i'm seeing, maybe it isn't)

Timings/comparisons:
old (current trunk/head)

$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-old.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-old.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-old.html'

 Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-old.html' (25 runs):

           6624.73 msec task-clock                #    0.999 CPUs utilized            ( +-  0.53% )
               172      context-switches          #   25.965 M/sec                    ( +- 29.89% )
                 0      cpu-migrations            #    0.042 M/sec                    ( +- 56.54% )
             31073      page-faults               # 4690.754 M/sec                    ( +-  0.08% )
       26538711696      cycles                    # 4006230.292 GHz                   ( +-  0.53% )  (83.31%)
        2017496807      stalled-cycles-frontend   #    7.60% frontend cycles idle     ( +-  0.93% )  (83.32%)
       13403650062      stalled-cycles-backend    #   50.51% backend cycles idle      ( +-  0.33% )  (33.37%)
       19770706799      instructions              #    0.74  insn per cycle         
                                                  #    0.68  stalled cycles per insn  ( +-  0.04% )  (50.04%)
        4419821812      branches                  # 667207369.714 M/sec               ( +-  0.03% )  (66.69%)
         121741669      branch-misses             #    2.75% of all branches          ( +-  0.28% )  (83.34%)

            6.6283 +- 0.0358 seconds time elapsed  ( +-  0.54% )

patch, with reclustering but without filtering (i.e. outputting all the stable *and* unstable clusters)

$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-all.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-all.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-all.html'

 Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-all.html' (25 runs):

           6475.29 msec task-clock                #    0.999 CPUs utilized            ( +-  0.31% )
               213      context-switches          #   32.952 M/sec                    ( +- 23.81% )
                 1      cpu-migrations            #    0.130 M/sec                    ( +- 43.84% )
             31287      page-faults               # 4832.057 M/sec                    ( +-  0.08% )
       25939086577      cycles                    # 4006160.279 GHz                   ( +-  0.31% )  (83.31%)
        1958812858      stalled-cycles-frontend   #    7.55% frontend cycles idle     ( +-  0.68% )  (83.32%)
       13218961512      stalled-cycles-backend    #   50.96% backend cycles idle      ( +-  0.29% )  (33.37%)
       19752995402      instructions              #    0.76  insn per cycle         
                                                  #    0.67  stalled cycles per insn  ( +-  0.04% )  (50.04%)
        4417079244      branches                  # 682195472.305 M/sec               ( +-  0.03% )  (66.70%)
         121510065      branch-misses             #    2.75% of all branches          ( +-  0.19% )  (83.34%)

            6.4832 +- 0.0229 seconds time elapsed  ( +-  0.35% )

Funnily, *this* measurement shows that said reclustering actually improved performance.

patch, with reclustering, only the stable clusters

$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-stable.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-stable.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-stable.html'

 Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-stable.html' (25 runs):

           6387.71 msec task-clock                #    0.999 CPUs utilized            ( +-  0.13% )
               133      context-switches          #   20.792 M/sec                    ( +- 23.39% )
                 0      cpu-migrations            #    0.063 M/sec                    ( +- 61.24% )
             31318      page-faults               # 4903.256 M/sec                    ( +-  0.08% )
       25591984967      cycles                    # 4006786.266 GHz                   ( +-  0.13% )  (83.31%)
        1881234904      stalled-cycles-frontend   #    7.35% frontend cycles idle     ( +-  0.25% )  (83.33%)
       13209749965      stalled-cycles-backend    #   51.62% backend cycles idle      ( +-  0.16% )  (33.36%)
       19767554347      instructions              #    0.77  insn per cycle         
                                                  #    0.67  stalled cycles per insn  ( +-  0.04% )  (50.03%)
        4417480305      branches                  # 691618858.046 M/sec               ( +-  0.03% )  (66.68%)
         118676358      branch-misses             #    2.69% of all branches          ( +-  0.07% )  (83.33%)

            6.3954 +- 0.0118 seconds time elapsed  ( +-  0.18% )

Performance improved even further?! Makes sense i guess, less clusters to print.

patch, with reclustering, only the unstable clusters

$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-unstable.html -analysis-display-unstable-clusters
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-unstable.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-unstable.html'

 Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-unstable.html -analysis-display-unstable-clusters' (25 runs):

           6124.96 msec task-clock                #    1.000 CPUs utilized            ( +-  0.20% )
               194      context-switches          #   31.709 M/sec                    ( +- 20.46% )
                 0      cpu-migrations            #    0.039 M/sec                    ( +- 49.77% )
             31413      page-faults               # 5129.261 M/sec                    ( +-  0.06% )
       24536794267      cycles                    # 4006425.858 GHz                   ( +-  0.19% )  (83.31%)
        1676085087      stalled-cycles-frontend   #    6.83% frontend cycles idle     ( +-  0.46% )  (83.32%)
       13035595603      stalled-cycles-backend    #   53.13% backend cycles idle      ( +-  0.16% )  (33.36%)
       18260877653      instructions              #    0.74  insn per cycle         
                                                  #    0.71  stalled cycles per insn  ( +-  0.05% )  (50.03%)
        4112411983      branches                  # 671484364.603 M/sec               ( +-  0.03% )  (66.68%)
         114066929      branch-misses             #    2.77% of all branches          ( +-  0.11% )  (83.32%)

            6.1278 +- 0.0121 seconds time elapsed  ( +-  0.20% )

This tells us that the actual -analysis-inconsistencies-output-file= outputting only takes ~0.4 sec for 43970 benchmark points (3 whole sweeps)
(Also, wow this is fast, it used to take several minutes originally)

Fixes PR40715.

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Feb 18 2019, 8:04 AM
courbet added inline comments.Feb 19 2019, 4:49 AM
tools/llvm-exegesis/lib/Clustering.cpp
168 ↗(On Diff #187245)

"The list of opcodes that have more than one cluster".

172 ↗(On Diff #187245)

Why not const auto& ?

192 ↗(On Diff #187245)

for (const size_t UnstableOpcode for safety.

214 ↗(On Diff #187245)

at least, not at most.

216 ↗(On Diff #187245)

This for loop + CleanedPointIndices could be removed using std::remove_if:

// Find which points should be moved to the new cluster.
const auto it = std::remove_if(OldCluster.PointIndices.begin(), 
                              OldCluster.PointIndices.end(),
                              [this, UnstableOpcode](size_t P){ return Points_[P].keyInstruction().getOpcode() == UnstableOpcode; });
// Move removed points to the new cluster:
UnstableCluster.PointIndices.insert(it, OldCluster.PointIndices.end());
// Remove points form the old cluster.
OldCluster.PointIndices.erase(it, OldCluster.PointIndices.end());
tools/llvm-exegesis/llvm-exegesis.cpp
444 ↗(On Diff #187245)

nit: InstrInfo

445 ↗(On Diff #187245)

Why not: std::unique_ptr<llvm::MCInstrInfo> InstrInfo(TheTarget->createMCInstrInfo()); ?

lebedev.ri marked 8 inline comments as done.

Address most of @courbet's review notes.

tools/llvm-exegesis/lib/Clustering.cpp
172 ↗(On Diff #187245)

Hm, i see all three variants within the codebase.
I guess const auto& is better.

214 ↗(On Diff #187245)

Sure. I have meant that any other assumption would be too optimistic,
and we would end up reallocating if we guessed wrong.
A guess of one will never be too optimistic, and won't cause reallocations.
At worst, we will over-allocate a bit.

216 ↗(On Diff #187245)

Hmm, are you sure?

http://cpp.sh/7dfh5
If that is so, then that ^ should have
printed Textwithsomewhitespaces but it prints Textwithsomewhitespacesespaces.

https://en.cppreference.com/w/cpp/algorithm/remove

Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition).

unspecified values is pretty self-explanatory..

courbet added inline comments.Feb 19 2019, 7:10 AM
tools/llvm-exegesis/lib/Clustering.cpp
216 ↗(On Diff #187245)

Yes, sorry. std::stable_partition should work.

lebedev.ri marked 4 inline comments as done.

Address @courbet's review notes.

tools/llvm-exegesis/lib/Clustering.cpp
216 ↗(On Diff #187245)

Aha. Not as simple as that snippet, but works.

lebedev.ri added inline comments.Feb 19 2019, 11:25 PM
tools/llvm-exegesis/lib/Clustering.cpp
216 ↗(On Diff #187245)

Thinking about this a bit more, do we really care that "Relative order of the elements is preserved."?
I don't think so. Only that the new order is deterministic.
Can we just use [[ https://en.cppreference.com/w/cpp/algorithm/partition | std::partition ]] instead?

lebedev.ri marked 2 inline comments as done.Feb 20 2019, 12:19 AM
lebedev.ri added inline comments.
tools/llvm-exegesis/lib/Clustering.cpp
216 ↗(On Diff #187245)

Actually, hmm, i'm not confident std::partition *is* deterministic.
Never mind.

courbet accepted this revision.Feb 20 2019, 12:43 AM
This revision is now accepted and ready to land.Feb 20 2019, 12:43 AM
lebedev.ri marked an inline comment as done.Feb 20 2019, 12:47 AM

Yay, thank you for the review!

This revision was automatically updated to reflect the committed changes.