This is an archive of the discontinued LLVM Phabricator instance.

utils/compare.py: Fix support for multiple metrics
ClosedPublic

Authored by arichardson on Mar 17 2022, 5:39 AM.

Details

Summary

In addition, this retains the pandas multi-index to improve the
visualization. We also transform the program column in-place now since
it appears that DataFrame.to_string() ignores the formatter argument for
all columns that have dtype=object.

Before: utils/compare.py a.json vs b.json --lhs-name=baseline --rhs-name=new -m compile_time -m link_time --nodiff

Program                                        baseline  new    baseline  new
-C++/Shootout-C++-lists                         0.00      0.00   0.09      0.08
-C++/Shootout-C++-objinst                       0.00      0.00   0.07      0.08
...
         baseline         new    baseline         new
count  315.000000  315.000000  315.000000  315.000000
mean   3.543040    3.531691    0.076500    0.078680
...

As can be seen here, the metric is missing from the column headers, so it's not
clear which one is which if you use multiple metrics.

With this change we now see the metrics, and they are sorted by largest diff:

Program                                       compile_time        link_time
                                              baseline     new    baseline  new
SingleSource/Benchmarks/Stanford/Perm           0.07         0.08   0.08      0.11
SingleSour...Benchmarks/Stanford/Bubblesort     0.07         0.08   0.10      0.10
...
      compile_time               link_time
l/r       baseline         new    baseline         new
count  315.000000   315.000000  315.000000  315.000000
mean   3.543040     3.531691    0.076500    0.078680
...

More importantly (my original motivation for this change), the script currently
fails to run when passing multiple metrics without --nodiff. For example,
utils/compare.py a.json vs b.json -m compile_time -m link_time result in:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I believe this used to work and regressed in 0ee69b891c3b167474212eb996b78c23f82bc13d
(https://reviews.llvm.org/D57828) as the exception is raised in add_geomean_row()

After:

Program                                       compile_time              link_time
                                              baseline     new    diff  baseline  new    diff
SingleSource/Benchmarks/Stanford/Perm           0.07         0.08 26.2%   0.08      0.11  34.4%
SingleSour...Benchmarks/Stanford/Bubblesort     0.07         0.08 16.9%   0.10      0.10   4.0%
...
                           Geomean difference                     -0.4%                    2.9%
      compile_time                           link_time
l/r       baseline         new        diff    baseline         new        diff
count  315.000000   315.000000  283.000000  315.000000  315.000000  315.000000
mean   3.543040     3.531691   -0.002523    0.076500    0.078680    0.041652
...

Listing multiple metrics with a single file was also broken:
utils/compare.py a.json -m compile_time -m link_time results an exception
while executing d['$sortkey'] = d[sortkey].
ValueError: Expected a 1D array, got an array with shape (315, 2).
This problem has most likely existed since the initial commit that added this
script, but with this change we can now handle mutiple metrics using the newer
pandas d.sort_values(by=(metrics[0], sortkey), ascending=False) instead of
adding a new column and sorting on that. We now get a table sorted by the
first metric:

Program                                       compile_time link_time
                                              results      results
ultiSource...Benchmarks/7zip/7zip-benchmark   109.56         0.22
ultiSource/Benchmarks/Bullet/bullet            73.55         0.17
...
      compile_time   link_time
run        results     results
count  315.000000   315.000000
mean   3.543040     0.076500
...

Event Timeline

arichardson created this revision.Mar 17 2022, 5:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2022, 5:39 AM
arichardson requested review of this revision.Mar 17 2022, 5:39 AM

Ensure the full terminal width is used for the d.describe() call.

MatzeB accepted this revision.Apr 20 2022, 10:13 AM

Changes make sense to me, and a quick sanity check of the script seemed fine.

Thanks, LGTM

utils/compare.py
217

Seems like you fixed the TODO and can remove it?

This revision is now accepted and ready to land.Apr 20 2022, 10:13 AM
This revision was automatically updated to reflect the committed changes.
arichardson marked an inline comment as done.