This is an archive of the discontinued LLVM Phabricator instance.

[PassManager] Run global opts after the inliner
ClosedPublic

Authored by davide on Sep 21 2017, 2:26 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

davide created this revision.Sep 21 2017, 2:26 PM

This is a proposed solution to PR34652
We could in theory teach the inline some other tricks, but this seems to be harmless and good anyway.
The rust folks ran quite a bit of benchmarks (and saw up to 20% reduction in build time), I ran some internal code and saw up to 5% reductions without any significant increase (within the noise)
cc: @mzolotukhin @Gerolf as they expressed interest in the area.

chandlerc edited edge metadata.Sep 21 2017, 2:28 PM

Could you include analogous updates to the new PM's pipeline? (I can do it myself in a follow-up if you'd rather...)

Indeed. I'm going to update the patch in a sec.

efriedma added inline comments.
lib/Transforms/IPO/PassManagerBuilder.cpp
513 ↗(On Diff #116266)

Are you sure this is where you meant to put these passes...? GlobalOpt and GlobaDCE are module passes, so you're separating the inliner from the function simplification passes.

I didn't actually notice until now that a bunch of tests are failing (I tested on several codebases, but not in LLVM itself :)
So I'm going to audit all of them and see whether the Xform makes sense.

********************
Testing Time: 44.51s
********************
Failing Tests (17):
    LLVM :: CodeGen/AMDGPU/early-inline.ll
    LLVM :: Other/pass-pipelines.ll
    LLVM :: ThinLTO/X86/diagnostic-handler-remarks-with-hotness.ll
    LLVM :: ThinLTO/X86/diagnostic-handler-remarks.ll
    LLVM :: Transforms/Coroutines/ArgAddr.ll
    LLVM :: Transforms/Coroutines/coro-split-01.ll
    LLVM :: Transforms/Coroutines/ex0.ll
    LLVM :: Transforms/Coroutines/ex1.ll
    LLVM :: Transforms/Coroutines/ex2.ll
    LLVM :: Transforms/Coroutines/ex3.ll
    LLVM :: Transforms/Coroutines/ex4.ll
    LLVM :: Transforms/Coroutines/ex5.ll
    LLVM :: Transforms/Coroutines/no-suspend.ll
    LLVM :: Transforms/Coroutines/phi-coro-end.ll
    LLVM :: Transforms/Inline/devirtualize.ll
    LLVM :: Transforms/Inline/inline_minisize.ll
    LLVM :: Transforms/PhaseOrdering/globalaa-retained.ll
lib/Transforms/IPO/PassManagerBuilder.cpp
513 ↗(On Diff #116266)

No, you're right, they should be outside the if (actually, that's where I put that in the first place, then wrongly moved).

I didn't actually notice until now that a bunch of tests are failing (I tested on several codebases, but not in LLVM itself :)
So I'm going to audit all of them and see whether the Xform makes sense.

********************
Testing Time: 44.51s
********************
Failing Tests (17):
    LLVM :: CodeGen/AMDGPU/early-inline.ll
    LLVM :: Other/pass-pipelines.ll
    LLVM :: ThinLTO/X86/diagnostic-handler-remarks-with-hotness.ll
    LLVM :: ThinLTO/X86/diagnostic-handler-remarks.ll
    LLVM :: Transforms/Coroutines/ArgAddr.ll
    LLVM :: Transforms/Coroutines/coro-split-01.ll
    LLVM :: Transforms/Coroutines/ex0.ll
    LLVM :: Transforms/Coroutines/ex1.ll
    LLVM :: Transforms/Coroutines/ex2.ll
    LLVM :: Transforms/Coroutines/ex3.ll
    LLVM :: Transforms/Coroutines/ex4.ll
    LLVM :: Transforms/Coroutines/ex5.ll
    LLVM :: Transforms/Coroutines/no-suspend.ll
    LLVM :: Transforms/Coroutines/phi-coro-end.ll
    LLVM :: Transforms/Inline/devirtualize.ll
    LLVM :: Transforms/Inline/inline_minisize.ll
    LLVM :: Transforms/PhaseOrdering/globalaa-retained.ll

The reason why all these tests are failing is that, rather unfortunately, specify -O2 to test specific functionalities on the command line instead of individual/set of passes.

GorNishanov added inline comments.
lib/Transforms/IPO/PassManagerBuilder.cpp
513 ↗(On Diff #116266)

Should those be MPM.add as opposed to PM.add ?

davide added a comment.Oct 1 2017, 9:05 PM

Yes, I'll update a new version of the patch early tomorrow.

mehdi_amini added inline comments.Oct 1 2017, 10:29 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
511 ↗(On Diff #116266)

Did you run any benchmarks to conclude that it is faster? (The referenced PR seems to only mention some specific IR generated by Rust IIUC)

chandlerc added inline comments.Oct 1 2017, 10:45 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
511 ↗(On Diff #116266)

I chatted some with Davide about this on IRC and he had seen similar issues with C++. Fundamentally, I would expect the same pattern to be able to come up in C++.

mehdi_amini added inline comments.Oct 1 2017, 10:48 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
511 ↗(On Diff #116266)

OK, it'd be nice to have a test-case with this patch though.

Basically, thanks Chandler.
I'm more than happy to make this an RFT (although I'm not holding my breath :)) and wait a bit before committing.
FWIW, I expect this to be fairly reasonable to do nonetheless (in fact, FWIW, the full LTO pipeline does it, and it never seemed to have a huge impact, or at least I haven't seen this showing up a lot in a profile). I'll update the new patch tomorrow and try to run some tests (sorry, this one doesn't even compile as I rewrote instead of copying from the machine where I ran the tests).

davide added inline comments.Oct 1 2017, 10:51 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
511 ↗(On Diff #116266)

Testing this feature is ... a little hard with the current infrastructure.
I could write an -O2 test, if you'd like, but it's unclear how robust it is (in particular, if something before changes).

mehdi_amini added inline comments.Oct 1 2017, 10:58 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
511 ↗(On Diff #116266)

I've always felt we should have more of integration test (i.e. checking that code is, for instance, correctly vectorized when run through O2, etc.). Not as a lame replacement for targeted opt tests, but because end-to-end seems important.
I think Chandler told me that his view was that it should be caught by benchmarking, in which case we should observe an effect on the compile-time LNT suite for this change (or it would indicate possibly a missing test-case)

davide added inline comments.
lib/Transforms/IPO/PassManagerBuilder.cpp
511 ↗(On Diff #116266)

At some point @dberlin told me about the GCC infrastructure for end-to-end pipeline testing (and how it helped him to find a series of bugs in the past) for this so I ended up taking a look.
It allows to break at arbitrary points in the pipeline and make sure you actually do the right thing in the next pass, so it's fair more robust in case passes are introduced before/after the one you want to test.
In theory, opt-bisect=X kinda approximates this behaviour, but it's more an helper for debugging than an helper for testing.

Maybe lit could grow a series of directive, e.g --check-stop-before=PASSNAME so that we check the IR right before and right after that pass has run on a given function. It of course gets a little more complicated when the unit of IR on which you're working on is not anymore a function, but, e.g. an SCC, but having the function pass pipeline better tested end-to-end would be already an enormous improvement to the status quo.

davide added a comment.Oct 2 2017, 4:11 PM

Ran on clang itself:

Unpatched:

real    11m39.262s
user    264m10.512s
sys     6m10.221s

Patched::

real    11m38.739s
user    264m20.851s
sys     6m8.897s

No significant difference.

davide updated this revision to Diff 117443.Oct 2 2017, 4:19 PM

Put the pass in the correct place in the pipeline (after CGSCC, so all the failures are gone).
Update all the tests.

davide added a comment.Oct 2 2017, 4:20 PM

The rust folks also re-run with the new patch and confirmed the gain.
https://github.com/rust-lang/llvm/pull/92#issuecomment-333674662

I ran with the same benchmark I have internally and I saw the same benefit. For clang, lld, and llvm-tblgen, this change is neutral.
PTAL :)

davide added a comment.Oct 2 2017, 4:20 PM

FWIW, I'll handle the same change for the new PM in a follow-up.

I'm generally fine with this going in given the measurements.

Mehdi, would you feel comfortable with this as-is? I think even if we get more integration-style testing it will be hard to see this particular change because it should only materialize in compile time savings, and we may just not have any useful IR test cases in the test suite that would see that.

I think the closest thing to a way to test this change would be to take some pattern that is *really* bad without this and include it in the test suite (much like we have other IR test cases in the test suite from other frontends). But I'm not sure it makes sense to block this on landing that. I generally think that is something we should encourage the Rust community to work on...

lib/Transforms/IPO/PassManagerBuilder.cpp
528 ↗(On Diff #117443)

We shouldn't include PR links in the source like this. Whatever context is needed should be summarized (and I think it already is).

Mehdi, would you feel comfortable with this as-is?

Sure.

I put a suggestion inline though, because of interaction with -flto=thin

lib/Transforms/IPO/PassManagerBuilder.cpp
532 ↗(On Diff #117443)

What about moving this a little bit below and replace:

// If we are planning to perform ThinLTO later, let's not bloat the code with
// unrolling/vectorization/... now. We'll first run the inliner + CGSCC passes
// during ThinLTO and perform the rest of the optimizations afterward.
if (PrepareForThinLTO) {
  // Reduce the size of the IR as much as possible.
  MPM.add(createGlobalOptimizerPass());
  // Rename anon globals to be able to export them in the summary.
  MPM.add(createNameAnonGlobalPass());
  return;
}

with:

if (RunInliner) {
  MPM.add(createGlobalOptimizerPass());
  MPM.add(createGlobalDCEPass());
}

// If we are planning to perform ThinLTO later, let's not bloat the code with
// unrolling/vectorization/... now. We'll first run the inliner + CGSCC passes
// during ThinLTO and perform the rest of the optimizations afterward.
if (PrepareForThinLTO) {
  // Rename anon globals to be able to export them in the summary.
  MPM.add(createNameAnonGlobalPass());
  return;
}

to avoid running two times the GlobalOptimizer

Hi Davide,

Could you please test it on CTMark as well (it's a part of the LLVM test-suite and usually takes ~5 minutes to compile)? I wonder if we see improvements there.

Thanks,
Michael

chandlerc added inline comments.Oct 2 2017, 6:02 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
532 ↗(On Diff #117443)

+1

davide added a comment.Oct 2 2017, 6:15 PM

Hi Davide,

Could you please test it on CTMark as well (it's a part of the LLVM test-suite and usually takes ~5 minutes to compile)? I wonder if we see improvements there.

Thanks,
Michael

Sure, Michael.
Is there a single-line cmake invocation I can use to get the results and grab numbers.
I don't necessarily fancy the idea of setting up an LNT instance, but if I have to, I'll do.

davide added a comment.Oct 2 2017, 6:16 PM

Mehdi, would you feel comfortable with this as-is?

Sure.

I put a suggestion inline though, because of interaction with -flto=thin

Makes sense. Thanks Mehdi & Chandler for the reviews!

I'd recommend using LNT, here is a script with full ground-up setup for your convenience:

cd tmp

# Setup virtual environment
/usr/local/bin/virtualenv venv
. venv/bin/activate

# Checkout test-suite and LNT
for D in test-suite lnt
  do
    if [ ! -d "$D" ]; then
      git clone http://llvm.org/git/$D
    fi
  done

# Install LNT
pip install -r lnt/requirements.client.txt
python lnt/setup.py develop

# Install LIT
pip install  svn+http://llvm.org/svn/llvm-project/llvm/trunk/utils/lit/

# Set some variables
SANDBOX=/Path/To/Where/LNT/Will/Run
COMPILER=/Path/To/Your/Compiler
TESTSUITE=$PWD/test-suite       # We should've checked it out here
OPTSET=ReleaseLTO               # Or Os, or O0-g, or ReleaseThinLTO (see test-suite/cmake/cache for other options)

# Create sandbox
mkdir $SANDBOX
cd $SANDBOX

# Fill in LNT flags
LNT_FLAGS =" --sandbox $SANDBOX"
LNT_FLAGS+=" --no-timestamp"
LNT_FLAGS+=" --use-lit=lit"
LNT_FLAGS+=" --cc $COMPILER/bin/clang"
LNT_FLAGS+=" --cxx $COMPILER/bin/clang++"
LNT_FLAGS+=" --test-suite=$TESTSUITE"
LNT_FLAGS+=" --cmake-define TEST_SUITE_BENCHMARKING_ONLY=On"
LNT_FLAGS+=" --no-auto-name '${COMPILER%.*}'"
LNT_FLAGS+=" --output \"$SANDBOX/report.json\""
LNT_FLAGS+=" -C target-arm64-iphoneos -C $OPTSET"

LNT_FLAGS+=" --cmake-define TEST_SUITE_RUN_BENCHMARKS=Off"
LNT_FLAGS+=" --build-threads 1"
LNT_FLAGS+=" --cmake-define TEST_SUITE_SUBDIRS=\"CTMark\""

# Run LNT
lnt runtest test-suite ${LNT_FLAGS}

This will test the given compiler and generate a report in report.json in the sandbox folder with the results. There is a test-suite/utils/compare.py script for comparing such reports, though it has additional requirements for python packages, so it might be a bit cumbersome to use it.

Michael

I'd recommend using LNT,

This seems like a really cumbersome process to expect folks to go through for changes intending to improve compile time. Is there no way to simplify it?

For example, we can easily get test suite results just using llvm-lit directly with a build of the test-suite. Is there anything more comparable to this in terms of overhead for developers?

(Also, your instructions talk about target-arm64-iphoneos which I suspect needs to be adjusted for most folks?)

This will test the given compiler and generate a report in report.json in the sandbox folder with the results. There is a test-suite/utils/compare.py script for comparing such reports, though it has additional requirements for python packages, so it might be a bit cumbersome to use it.

Heh, I don't find compare.py cumbersome, doing the full LNT setup seems much moreso... Installing packages is relatively easy, its the steps and deep layers that seem somewhat burdensome (IMO). I've really like the direction of the test-suite recently to run with llvm-lit and nothing else really.

I'd recommend using LNT,

This seems like a really cumbersome process to expect folks to go through for changes intending to improve compile time. Is there no way to simplify it?

Which part is cumbersome? I use to do this all the time, and it is actually pretty nice when you get use to it. Especially it can build once and run multiple time each test to get more sample and reduce the measurement noise. The comparison script is easy to use as well.

Note, in case there is any confusion: this does not require to setup or use an LNT *instance* (as in the web application).

On the other hand I never found any similar convenience with the cmake/lit use of the test-suite (maybe too young to have proper support?).

I'd recommend using LNT,

This seems like a really cumbersome process to expect folks to go through for changes intending to improve compile time. Is there no way to simplify it?

Which part is cumbersome?

I already have test-suite checked out and correctly hooked up to SPEC and other things. LNT is a large Python project including server code and other things. It all runs in a Python virtualenv which I don't know anything about and don't use for any other part of development.

I use to do this all the time, and it is actually pretty nice when you get use to it. Especially it can build once and run multiple time each test to get more sample and reduce the measurement noise.

Can definitely get multiple runs with lit directly, no need for virtualenv or LNT... In fact, AFAIK, LNT just uses lit to do all of this already. LNT seems really focused on running a server and collecting and aggregating results over time, rather than checking a single commit.

The comparison script is easy to use as well.

Yeah, I have used the comparison script myself with lit directly, seems quite nice.

Note, in case there is any confusion: this does not require to setup or use an LNT *instance* (as in the web application).

Yes, but you're still paying for a lot of this infrastructure.

On the other hand I never found any similar convenience with the cmake/lit use of the test-suite (maybe too young to have proper support?).

Possibly, dunno when this support landed.

I'll give it a whirl tomorrow.
FWIW, I share Chandler's concerns about the cost of running LNT vs having something hooked in the build system.
I think it's really valuable (and I used it in the past) but it's a little too much for everyday's use [at least for my workflow].
I use it once every 2-3 months, and then I leave it aside, and I forget how to set up (so I have to read the doc again, or ask around, as I did this time :)

That said:
IIRC at some point I and Sean were able to just type make and get a JSON file spit out (I'll try that again tomorrow).
The only feature missing was that of having a lit-like set of checks to make sure your changes didn't screw up compile time.
e.g.

ninja check-perf-baseline  -> this saves the baseline somewhere
git apply myuberpatch.diff
ninja clang
ninja check-perf -> this now compares the new run with the baseline and errors out in case size/compile time blows out [you can probably imagine having the tolerance threshold as a input]
ninja wipe-perf-cache -> now this wipes out the baseline

Maybe I'm the only one thinking this is a better workflow for everyday's development, but the rationale behind my claim is that this almost mirrors what people do before committing a change anyway, so, running two commands on top shouldn't be a big deal.

Which part is cumbersome?

I was talking about compare.py specifically. Namely, it uses pandas package, which is not as commonly installed as others, and installing it wasn't very easy/fast if I remember it correctly. All I wanted to say is that these tools are required some getting used to and might not work from the first try, and in case of compare.py you might want to not use if you only need to do it once.

Can definitely get multiple runs with lit directly, no need for virtualenv or LNT... In fact, AFAIK, LNT just uses lit to do all of this already. LNT seems really focused on running a server and collecting and aggregating results over time, rather than checking a single commit.

This is true, one can run the same commands with cmake+make/ninja, and that'll give absolutely the same results. lnt runtest is just a wrapper for this.

On the other hand I never found any similar convenience with the cmake/lit use of the test-suite (maybe too young to have proper support?).

That actually surprised me. It should be pretty identical (at least now). Have you tried it recently?

The only feature missing was that of having a lit-like set of checks to make sure your changes didn't screw up compile time.

That's an interesting idea. I'm actually going to give a small refresher on how to collect compile-time results on LLVM Dev, we can discuss what can be improved in that area there.

Michael

I uploaded the results before and after. There are some performance wins in terms of size, which is good (this is at -Os).
The only concerning regression is 7zip which gets around 1.8% slower (compile time wise) for some amount of win (reduction in text and eh_frame/eh_frame_hdr).
I ran that several times and I saw this value oscillating between 0.8% and 1.8% so there's some noise to take in account.
Overall, I think this change didn't significantly regressed anywhere so I'd vote for going forward (I'll also send an heads-up on llvm-dev, just in case).
Thoughts?

Thanks!

Hm, from the files you uploaded I don't see a slowdown on 7zip, but see some others. Are we talking about the same numbers?:)

Metric: compile_time
Program                                         unpatched  patched diff

lencod/lencod                                  12.42      12.71    2.3%
kimwitu++/kc                                   13.47      13.76    2.1%
Bullet/bullet                                  28.41      28.99    2.0%
sqlite3/sqlite3                                 7.75       7.88    1.7%
consumer-typeset/consumer-typeset              11.19      11.37    1.6%
mafft/pairlocalalign                            7.16       7.25    1.3%
SPASS/SPASS                                    14.27      14.43    1.1%
ClamAV/clamscan                                12.96      13.06    0.8%
tramp3d-v4/tramp3d-v4                          18.24      18.19   -0.3%
7zip/7zip-benchmark                            34.97      34.96   -0.0%


Metric: size
Program                                         unpatched  patched diff

mafft/pairlocalalign                           355304     353552  -0.5%
tramp3d-v4/tramp3d-v4                          954304     950064  -0.4%
lencod/lencod                                  646000     644472  -0.2%
Bullet/bullet                                  1752656    1751688 -0.1%
7zip/7zip-benchmark                            1224296    1224336  0.0%
kimwitu++/kc                                   1157536    1157504 -0.0%
ClamAV/clamscan                                658536     658536   0.0%
SPASS/SPASS                                    566744     566744   0.0%
consumer-typeset/consumer-typeset              544184     544184   0.0%
sqlite3/sqlite3                                402200     402200   0.0%

BTW, is it possible to pull a testcase from rust codebase for CTMark? I think it's good to add more real world examples there.

Michael

davide added a comment.Oct 4 2017, 2:47 PM

Hm, from the files you uploaded I don't see a slowdown on 7zip, but see some others. Are we talking about the same numbers?:)

Metric: compile_time
Program                                         unpatched  patched diff

lencod/lencod                                  12.42      12.71    2.3%
kimwitu++/kc                                   13.47      13.76    2.1%
Bullet/bullet                                  28.41      28.99    2.0%
sqlite3/sqlite3                                 7.75       7.88    1.7%
consumer-typeset/consumer-typeset              11.19      11.37    1.6%
mafft/pairlocalalign                            7.16       7.25    1.3%
SPASS/SPASS                                    14.27      14.43    1.1%
ClamAV/clamscan                                12.96      13.06    0.8%
tramp3d-v4/tramp3d-v4                          18.24      18.19   -0.3%
7zip/7zip-benchmark                            34.97      34.96   -0.0%


Metric: size
Program                                         unpatched  patched diff

mafft/pairlocalalign                           355304     353552  -0.5%
tramp3d-v4/tramp3d-v4                          954304     950064  -0.4%
lencod/lencod                                  646000     644472  -0.2%
Bullet/bullet                                  1752656    1751688 -0.1%
7zip/7zip-benchmark                            1224296    1224336  0.0%
kimwitu++/kc                                   1157536    1157504 -0.0%
ClamAV/clamscan                                658536     658536   0.0%
SPASS/SPASS                                    566744     566744   0.0%
consumer-typeset/consumer-typeset              544184     544184   0.0%
sqlite3/sqlite3                                402200     402200   0.0%

BTW, is it possible to pull a testcase from rust codebase for CTMark? I think it's good to add more real world examples there.

Michael

Sorry, I meant sqlite3/sqlite3.
How do you get the diff out of curiosity? I eyeballed the differences (and I missed something, of course :))
BTW, yes, we can probably work out to add some rust test to ctmark, I'll ask Alex Crichton

How do you get the diff out of curiosity?

It's produced by the compare.py script:

python test-suite/utils/compare.py -m compile_time unpatched.json patched.json
python test-suite/utils/compare.py -m size unpatched.json patched.json

But to run it, you first will need pip install pandas, which can take up to 10 minutes :)

As for the results: even though I'm not super excited with the (mostly) slowdowns, the gain on the real project probably outweigh them, so I think it's the right thing to do.

One follow-up thought: how hard will it be to run GDCE only on functions? AFAIU, the gain is expected from removing functions that have been inlined and thus no longer need to be compiled. I wonder if we can avoid the slowdowns by looking only for the dead functions when we call GDCE after inlining.

Michael

davide added a comment.Oct 4 2017, 5:56 PM

Here's the link to the original PR, as Michael asked about it
https://github.com/rust-lang/llvm/pull/92

what levels of optimization is used in the tests?
I have only tested the rust projects with -O3 and -O0 due to it is the default when running release/debug builds in rust.
maybe a check of level is needed.

The tests have been run at -Os.

andjo403 added a comment.EditedOct 4 2017, 11:34 PM

I did a small test in rust with -Os and the compile time do not change with the patch so a check for OptLevel > 2 is maybe good to have.

I did a small test in rust with -Os and the compile time do not change with the patch so a check for OptLevel > 2 is maybe good to have.

Note that "no change" is really quite expected in many cases... I don't think that makes this change inappropriate.

I'm somewhat confused by people's concerns here. This change seems *really* safe:

  1. The passes added are among the fastest passes in LLVM.
  2. They cover a clear, demonstrable weakness of the rest of the pipeline, deleting dead IR earlier and avoiding expensive work on that IR.

It's of course good to check this stuff, but honestly this isn't the change I would be worried about for compile time...

davide added a comment.Oct 5 2017, 9:51 AM

I would've checked in this a while ago :) Anyway, I think it was good to run some tests, but this is clearly a good thing to do.
I'm going ahead and committing this as it seems we built enough consensus around it. In the worst case, we're going to revert it :)

I would've checked in this a while ago :) Anyway, I think it was good to run some tests, but this is clearly a good thing to do.
I'm going ahead and committing this as it seems we built enough consensus around it. In the worst case, we're going to revert it :)

SGTM, and yeah, we can tweak afterward easily.

This revision was automatically updated to reflect the committed changes.

when this was merged in to rust I noticed that it did not give any gain.
I think that I made some thing wrong when I tested your modification from my original patch.
as I understand it is to late at this location due to the most time is spent "in" addFunctionSimplificationPasses.

what was the problem with having the GDCE pass directly after the add of the inliner?

when this was merged in to rust I noticed that it did not give any gain.
I think that I made some thing wrong when I tested your modification from my original patch.
as I understand it is to late at this location due to the most time is spent "in" addFunctionSimplificationPasses.

what was the problem with having the GDCE pass directly after the add of the inliner?

That breaks the entire pipelining model of the LLVM passes and inliner. =/

The only thing I know to help at that point would be to build a CGSCC version of GlobalDCE. This is something we've wanted for quite some time, but it is a fair amount of work. And it will also be substantially more interesting in the new pass manager model of the call graph.

@chandlerc since we saw such a drastic decrease in compile times after including the original version of this patch, could you elaborate on the downsides of what would happen if we were to do that in our local fork of LLVM? Or would it still be unacceptable to send this patch upstream?

The downside of the original version of the patch is most likely worse codesize and runtime performance. If the inliner isn't interleaved correctly, the heuristics become less accurate; the inliner depends on the simplification passes to get rid of dead and redundant code.

Ok, thanks for the info!

MatzeB added a subscriber: MatzeB.EditedOct 10 2017, 3:07 PM

For the record, measuring CTMark can be as easy as this without any LNT involved:

git clone http://llvm.org/git/test-suite.git
mkdir test-suite-build
cd test-suite-build
cmake -DCMAKE_C_COMPILER=/path/to/my/clang -GNinja -DTEST_SUITE_SUBDIRS=CTMark -DTEST_SUITE_RUN_BENCHMARKS=Off ../test-suite
ninja
/path/to/my/llvm/builddir/bin/llvm-lit -o result.json .

Repeat for multiple compilers/patches (or with different -DCMAKE_C_FLAGS etc.) and in the end use compare.py to compare the results of multiple runs.

Some more points on benchmarking (really should rather write more docu than commenting here...):

  • Use ninja -j1 to get less noisy compiletime measurements (there of course are a lot of other things you can and should do to reduce the noise level on your system, search for apropriate llvm-dev mailinglist posts etc.)
  • On macOS you may need -C ../test-suite/cmake/caches/target-x86_64-macosx.cmake to pick up the right include directories from xcode/xcrun.
  • As with any cmake project I find ccmake . really helpful to discover possible options and flags.
  • To reduce noise, run the process multiple times and let compare.py take the minimum: compare.py base_run0.json base_run1.json base_run2.json vs patched_run0.json patched_run1.json patched_run2.json (Note: the vs is an actual commandline argument)

so it is possible that the large improvement in compile time is due to we broke the pipeline and due to this less optimization was possible/executed.

so it is possible that the large improvement in compile time is due to we broke the pipeline and due to this less optimization was possible/executed.

Unsure, but you can verify your theory running the code passing -time-passes to opt and compare the times (please note that it doesn't really take in account analyses in some cases [or accounts them to other passes], but it should be good enough to start)

For the record, measuring CTMark can be as easy as this without any LNT involved:

git clone http://llvm.org/git/test-suite.git
mkdir test-suite-build
cd test-suite-build
cmake -DCMAKE_C_COMPILER=/path/to/my/clang -GNinja -DTEST_SUITE_SUBDIRS=CTMark -DTEST_SUITE_RUN_BENCHMARKS=Off ../test-suite
ninja
/path/to/my/llvm/builddir/bin/llvm-lit -o result.json .

Repeat for multiple compilers/patches (or with different -DCMAKE_C_FLAGS etc.) and in the end use compare.py to compare the results of multiple runs.

Oh, thanks Matthias, that's very useful.

so it is possible that the large improvement in compile time is due to we broke the pipeline and due to this less optimization was possible/executed.

That's one possibility. There also is quite likely an issue Rust is hitting that the later positioning of these passes doesn't help with. It's hard or impossible to tell which is dominant here....

The only thing I'm confident in is what Eli said: this positioning *does* cripple the intended design of the core LLVM optimization pipeline and will severely hurt inlining and optimization quality on some code patterns. (Others, of course, will be largely unimpacted.)