Details
Diff Detail
Event Timeline
This is a proposed solution to PR34652
We could in theory teach the inline some other tricks, but this seems to be harmless and good anyway.
The rust folks ran quite a bit of benchmarks (and saw up to 20% reduction in build time), I ran some internal code and saw up to 5% reductions without any significant increase (within the noise)
cc: @mzolotukhin @Gerolf as they expressed interest in the area.
Could you include analogous updates to the new PM's pipeline? (I can do it myself in a follow-up if you'd rather...)
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
513 | Are you sure this is where you meant to put these passes...? GlobalOpt and GlobaDCE are module passes, so you're separating the inliner from the function simplification passes. |
I didn't actually notice until now that a bunch of tests are failing (I tested on several codebases, but not in LLVM itself :)
So I'm going to audit all of them and see whether the Xform makes sense.
******************** Testing Time: 44.51s ******************** Failing Tests (17): LLVM :: CodeGen/AMDGPU/early-inline.ll LLVM :: Other/pass-pipelines.ll LLVM :: ThinLTO/X86/diagnostic-handler-remarks-with-hotness.ll LLVM :: ThinLTO/X86/diagnostic-handler-remarks.ll LLVM :: Transforms/Coroutines/ArgAddr.ll LLVM :: Transforms/Coroutines/coro-split-01.ll LLVM :: Transforms/Coroutines/ex0.ll LLVM :: Transforms/Coroutines/ex1.ll LLVM :: Transforms/Coroutines/ex2.ll LLVM :: Transforms/Coroutines/ex3.ll LLVM :: Transforms/Coroutines/ex4.ll LLVM :: Transforms/Coroutines/ex5.ll LLVM :: Transforms/Coroutines/no-suspend.ll LLVM :: Transforms/Coroutines/phi-coro-end.ll LLVM :: Transforms/Inline/devirtualize.ll LLVM :: Transforms/Inline/inline_minisize.ll LLVM :: Transforms/PhaseOrdering/globalaa-retained.ll
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
513 | No, you're right, they should be outside the if (actually, that's where I put that in the first place, then wrongly moved). |
The reason why all these tests are failing is that, rather unfortunately, specify -O2 to test specific functionalities on the command line instead of individual/set of passes.
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
513 | Should those be MPM.add as opposed to PM.add ? |
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
511 | Did you run any benchmarks to conclude that it is faster? (The referenced PR seems to only mention some specific IR generated by Rust IIUC) |
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
511 | I chatted some with Davide about this on IRC and he had seen similar issues with C++. Fundamentally, I would expect the same pattern to be able to come up in C++. |
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
511 | OK, it'd be nice to have a test-case with this patch though. |
Basically, thanks Chandler.
I'm more than happy to make this an RFT (although I'm not holding my breath :)) and wait a bit before committing.
FWIW, I expect this to be fairly reasonable to do nonetheless (in fact, FWIW, the full LTO pipeline does it, and it never seemed to have a huge impact, or at least I haven't seen this showing up a lot in a profile). I'll update the new patch tomorrow and try to run some tests (sorry, this one doesn't even compile as I rewrote instead of copying from the machine where I ran the tests).
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
511 | Testing this feature is ... a little hard with the current infrastructure. |
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
511 | I've always felt we should have more of integration test (i.e. checking that code is, for instance, correctly vectorized when run through O2, etc.). Not as a lame replacement for targeted opt tests, but because end-to-end seems important. |
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
511 | At some point @dberlin told me about the GCC infrastructure for end-to-end pipeline testing (and how it helped him to find a series of bugs in the past) for this so I ended up taking a look. Maybe lit could grow a series of directive, e.g --check-stop-before=PASSNAME so that we check the IR right before and right after that pass has run on a given function. It of course gets a little more complicated when the unit of IR on which you're working on is not anymore a function, but, e.g. an SCC, but having the function pass pipeline better tested end-to-end would be already an enormous improvement to the status quo. |
Ran on clang itself:
Unpatched: real 11m39.262s user 264m10.512s sys 6m10.221s Patched:: real 11m38.739s user 264m20.851s sys 6m8.897s
No significant difference.
Put the pass in the correct place in the pipeline (after CGSCC, so all the failures are gone).
Update all the tests.
The rust folks also re-run with the new patch and confirmed the gain.
https://github.com/rust-lang/llvm/pull/92#issuecomment-333674662
I ran with the same benchmark I have internally and I saw the same benefit. For clang, lld, and llvm-tblgen, this change is neutral.
PTAL :)
I'm generally fine with this going in given the measurements.
Mehdi, would you feel comfortable with this as-is? I think even if we get more integration-style testing it will be hard to see this particular change because it should only materialize in compile time savings, and we may just not have any useful IR test cases in the test suite that would see that.
I think the closest thing to a way to test this change would be to take some pattern that is *really* bad without this and include it in the test suite (much like we have other IR test cases in the test suite from other frontends). But I'm not sure it makes sense to block this on landing that. I generally think that is something we should encourage the Rust community to work on...
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
528 | We shouldn't include PR links in the source like this. Whatever context is needed should be summarized (and I think it already is). |
Sure.
I put a suggestion inline though, because of interaction with -flto=thin
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
532 | What about moving this a little bit below and replace: // If we are planning to perform ThinLTO later, let's not bloat the code with // unrolling/vectorization/... now. We'll first run the inliner + CGSCC passes // during ThinLTO and perform the rest of the optimizations afterward. if (PrepareForThinLTO) { // Reduce the size of the IR as much as possible. MPM.add(createGlobalOptimizerPass()); // Rename anon globals to be able to export them in the summary. MPM.add(createNameAnonGlobalPass()); return; } with: if (RunInliner) { MPM.add(createGlobalOptimizerPass()); MPM.add(createGlobalDCEPass()); } // If we are planning to perform ThinLTO later, let's not bloat the code with // unrolling/vectorization/... now. We'll first run the inliner + CGSCC passes // during ThinLTO and perform the rest of the optimizations afterward. if (PrepareForThinLTO) { // Rename anon globals to be able to export them in the summary. MPM.add(createNameAnonGlobalPass()); return; } to avoid running two times the GlobalOptimizer |
Hi Davide,
Could you please test it on CTMark as well (it's a part of the LLVM test-suite and usually takes ~5 minutes to compile)? I wonder if we see improvements there.
Thanks,
Michael
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
532 | +1 |
Sure, Michael.
Is there a single-line cmake invocation I can use to get the results and grab numbers.
I don't necessarily fancy the idea of setting up an LNT instance, but if I have to, I'll do.
I'd recommend using LNT, here is a script with full ground-up setup for your convenience:
cd tmp # Setup virtual environment /usr/local/bin/virtualenv venv . venv/bin/activate # Checkout test-suite and LNT for D in test-suite lnt do if [ ! -d "$D" ]; then git clone http://llvm.org/git/$D fi done # Install LNT pip install -r lnt/requirements.client.txt python lnt/setup.py develop # Install LIT pip install svn+http://llvm.org/svn/llvm-project/llvm/trunk/utils/lit/ # Set some variables SANDBOX=/Path/To/Where/LNT/Will/Run COMPILER=/Path/To/Your/Compiler TESTSUITE=$PWD/test-suite # We should've checked it out here OPTSET=ReleaseLTO # Or Os, or O0-g, or ReleaseThinLTO (see test-suite/cmake/cache for other options) # Create sandbox mkdir $SANDBOX cd $SANDBOX # Fill in LNT flags LNT_FLAGS =" --sandbox $SANDBOX" LNT_FLAGS+=" --no-timestamp" LNT_FLAGS+=" --use-lit=lit" LNT_FLAGS+=" --cc $COMPILER/bin/clang" LNT_FLAGS+=" --cxx $COMPILER/bin/clang++" LNT_FLAGS+=" --test-suite=$TESTSUITE" LNT_FLAGS+=" --cmake-define TEST_SUITE_BENCHMARKING_ONLY=On" LNT_FLAGS+=" --no-auto-name '${COMPILER%.*}'" LNT_FLAGS+=" --output \"$SANDBOX/report.json\"" LNT_FLAGS+=" -C target-arm64-iphoneos -C $OPTSET" LNT_FLAGS+=" --cmake-define TEST_SUITE_RUN_BENCHMARKS=Off" LNT_FLAGS+=" --build-threads 1" LNT_FLAGS+=" --cmake-define TEST_SUITE_SUBDIRS=\"CTMark\"" # Run LNT lnt runtest test-suite ${LNT_FLAGS}
This will test the given compiler and generate a report in report.json in the sandbox folder with the results. There is a test-suite/utils/compare.py script for comparing such reports, though it has additional requirements for python packages, so it might be a bit cumbersome to use it.
Michael
This seems like a really cumbersome process to expect folks to go through for changes intending to improve compile time. Is there no way to simplify it?
For example, we can easily get test suite results just using llvm-lit directly with a build of the test-suite. Is there anything more comparable to this in terms of overhead for developers?
(Also, your instructions talk about target-arm64-iphoneos which I suspect needs to be adjusted for most folks?)
This will test the given compiler and generate a report in report.json in the sandbox folder with the results. There is a test-suite/utils/compare.py script for comparing such reports, though it has additional requirements for python packages, so it might be a bit cumbersome to use it.
Heh, I don't find compare.py cumbersome, doing the full LNT setup seems much moreso... Installing packages is relatively easy, its the steps and deep layers that seem somewhat burdensome (IMO). I've really like the direction of the test-suite recently to run with llvm-lit and nothing else really.
Which part is cumbersome? I use to do this all the time, and it is actually pretty nice when you get use to it. Especially it can build once and run multiple time each test to get more sample and reduce the measurement noise. The comparison script is easy to use as well.
Note, in case there is any confusion: this does not require to setup or use an LNT *instance* (as in the web application).
On the other hand I never found any similar convenience with the cmake/lit use of the test-suite (maybe too young to have proper support?).
I already have test-suite checked out and correctly hooked up to SPEC and other things. LNT is a large Python project including server code and other things. It all runs in a Python virtualenv which I don't know anything about and don't use for any other part of development.
I use to do this all the time, and it is actually pretty nice when you get use to it. Especially it can build once and run multiple time each test to get more sample and reduce the measurement noise.
Can definitely get multiple runs with lit directly, no need for virtualenv or LNT... In fact, AFAIK, LNT just uses lit to do all of this already. LNT seems really focused on running a server and collecting and aggregating results over time, rather than checking a single commit.
The comparison script is easy to use as well.
Yeah, I have used the comparison script myself with lit directly, seems quite nice.
Note, in case there is any confusion: this does not require to setup or use an LNT *instance* (as in the web application).
Yes, but you're still paying for a lot of this infrastructure.
On the other hand I never found any similar convenience with the cmake/lit use of the test-suite (maybe too young to have proper support?).
Possibly, dunno when this support landed.
I'll give it a whirl tomorrow.
FWIW, I share Chandler's concerns about the cost of running LNT vs having something hooked in the build system.
I think it's really valuable (and I used it in the past) but it's a little too much for everyday's use [at least for my workflow].
I use it once every 2-3 months, and then I leave it aside, and I forget how to set up (so I have to read the doc again, or ask around, as I did this time :)
That said:
IIRC at some point I and Sean were able to just type make and get a JSON file spit out (I'll try that again tomorrow).
The only feature missing was that of having a lit-like set of checks to make sure your changes didn't screw up compile time.
e.g.
ninja check-perf-baseline -> this saves the baseline somewhere git apply myuberpatch.diff ninja clang ninja check-perf -> this now compares the new run with the baseline and errors out in case size/compile time blows out [you can probably imagine having the tolerance threshold as a input] ninja wipe-perf-cache -> now this wipes out the baseline
Maybe I'm the only one thinking this is a better workflow for everyday's development, but the rationale behind my claim is that this almost mirrors what people do before committing a change anyway, so, running two commands on top shouldn't be a big deal.
Which part is cumbersome?
I was talking about compare.py specifically. Namely, it uses pandas package, which is not as commonly installed as others, and installing it wasn't very easy/fast if I remember it correctly. All I wanted to say is that these tools are required some getting used to and might not work from the first try, and in case of compare.py you might want to not use if you only need to do it once.
Can definitely get multiple runs with lit directly, no need for virtualenv or LNT... In fact, AFAIK, LNT just uses lit to do all of this already. LNT seems really focused on running a server and collecting and aggregating results over time, rather than checking a single commit.
This is true, one can run the same commands with cmake+make/ninja, and that'll give absolutely the same results. lnt runtest is just a wrapper for this.
On the other hand I never found any similar convenience with the cmake/lit use of the test-suite (maybe too young to have proper support?).
That actually surprised me. It should be pretty identical (at least now). Have you tried it recently?
The only feature missing was that of having a lit-like set of checks to make sure your changes didn't screw up compile time.
That's an interesting idea. I'm actually going to give a small refresher on how to collect compile-time results on LLVM Dev, we can discuss what can be improved in that area there.
Michael
I uploaded the results before and after. There are some performance wins in terms of size, which is good (this is at -Os).
The only concerning regression is 7zip which gets around 1.8% slower (compile time wise) for some amount of win (reduction in text and eh_frame/eh_frame_hdr).
I ran that several times and I saw this value oscillating between 0.8% and 1.8% so there's some noise to take in account.
Overall, I think this change didn't significantly regressed anywhere so I'd vote for going forward (I'll also send an heads-up on llvm-dev, just in case).
Thoughts?
Thanks!
Hm, from the files you uploaded I don't see a slowdown on 7zip, but see some others. Are we talking about the same numbers?:)
Metric: compile_time Program unpatched patched diff lencod/lencod 12.42 12.71 2.3% kimwitu++/kc 13.47 13.76 2.1% Bullet/bullet 28.41 28.99 2.0% sqlite3/sqlite3 7.75 7.88 1.7% consumer-typeset/consumer-typeset 11.19 11.37 1.6% mafft/pairlocalalign 7.16 7.25 1.3% SPASS/SPASS 14.27 14.43 1.1% ClamAV/clamscan 12.96 13.06 0.8% tramp3d-v4/tramp3d-v4 18.24 18.19 -0.3% 7zip/7zip-benchmark 34.97 34.96 -0.0% Metric: size Program unpatched patched diff mafft/pairlocalalign 355304 353552 -0.5% tramp3d-v4/tramp3d-v4 954304 950064 -0.4% lencod/lencod 646000 644472 -0.2% Bullet/bullet 1752656 1751688 -0.1% 7zip/7zip-benchmark 1224296 1224336 0.0% kimwitu++/kc 1157536 1157504 -0.0% ClamAV/clamscan 658536 658536 0.0% SPASS/SPASS 566744 566744 0.0% consumer-typeset/consumer-typeset 544184 544184 0.0% sqlite3/sqlite3 402200 402200 0.0%
BTW, is it possible to pull a testcase from rust codebase for CTMark? I think it's good to add more real world examples there.
Michael
Sorry, I meant sqlite3/sqlite3.
How do you get the diff out of curiosity? I eyeballed the differences (and I missed something, of course :))
BTW, yes, we can probably work out to add some rust test to ctmark, I'll ask Alex Crichton
How do you get the diff out of curiosity?
It's produced by the compare.py script:
python test-suite/utils/compare.py -m compile_time unpatched.json patched.json python test-suite/utils/compare.py -m size unpatched.json patched.json
But to run it, you first will need pip install pandas, which can take up to 10 minutes :)
As for the results: even though I'm not super excited with the (mostly) slowdowns, the gain on the real project probably outweigh them, so I think it's the right thing to do.
One follow-up thought: how hard will it be to run GDCE only on functions? AFAIU, the gain is expected from removing functions that have been inlined and thus no longer need to be compiled. I wonder if we can avoid the slowdowns by looking only for the dead functions when we call GDCE after inlining.
Michael
Here's the link to the original PR, as Michael asked about it
https://github.com/rust-lang/llvm/pull/92
what levels of optimization is used in the tests?
I have only tested the rust projects with -O3 and -O0 due to it is the default when running release/debug builds in rust.
maybe a check of level is needed.
I did a small test in rust with -Os and the compile time do not change with the patch so a check for OptLevel > 2 is maybe good to have.
Note that "no change" is really quite expected in many cases... I don't think that makes this change inappropriate.
I'm somewhat confused by people's concerns here. This change seems *really* safe:
- The passes added are among the fastest passes in LLVM.
- They cover a clear, demonstrable weakness of the rest of the pipeline, deleting dead IR earlier and avoiding expensive work on that IR.
It's of course good to check this stuff, but honestly this isn't the change I would be worried about for compile time...
I would've checked in this a while ago :) Anyway, I think it was good to run some tests, but this is clearly a good thing to do.
I'm going ahead and committing this as it seems we built enough consensus around it. In the worst case, we're going to revert it :)
when this was merged in to rust I noticed that it did not give any gain.
I think that I made some thing wrong when I tested your modification from my original patch.
as I understand it is to late at this location due to the most time is spent "in" addFunctionSimplificationPasses.
what was the problem with having the GDCE pass directly after the add of the inliner?
That breaks the entire pipelining model of the LLVM passes and inliner. =/
The only thing I know to help at that point would be to build a CGSCC version of GlobalDCE. This is something we've wanted for quite some time, but it is a fair amount of work. And it will also be substantially more interesting in the new pass manager model of the call graph.
@chandlerc since we saw such a drastic decrease in compile times after including the original version of this patch, could you elaborate on the downsides of what would happen if we were to do that in our local fork of LLVM? Or would it still be unacceptable to send this patch upstream?
The downside of the original version of the patch is most likely worse codesize and runtime performance. If the inliner isn't interleaved correctly, the heuristics become less accurate; the inliner depends on the simplification passes to get rid of dead and redundant code.
For the record, measuring CTMark can be as easy as this without any LNT involved:
git clone http://llvm.org/git/test-suite.git mkdir test-suite-build cd test-suite-build cmake -DCMAKE_C_COMPILER=/path/to/my/clang -GNinja -DTEST_SUITE_SUBDIRS=CTMark -DTEST_SUITE_RUN_BENCHMARKS=Off ../test-suite ninja /path/to/my/llvm/builddir/bin/llvm-lit -o result.json .
Repeat for multiple compilers/patches (or with different -DCMAKE_C_FLAGS etc.) and in the end use compare.py to compare the results of multiple runs.
Some more points on benchmarking (really should rather write more docu than commenting here...):
- Use ninja -j1 to get less noisy compiletime measurements (there of course are a lot of other things you can and should do to reduce the noise level on your system, search for apropriate llvm-dev mailinglist posts etc.)
- On macOS you may need -C ../test-suite/cmake/caches/target-x86_64-macosx.cmake to pick up the right include directories from xcode/xcrun.
- As with any cmake project I find ccmake . really helpful to discover possible options and flags.
- To reduce noise, run the process multiple times and let compare.py take the minimum: compare.py base_run0.json base_run1.json base_run2.json vs patched_run0.json patched_run1.json patched_run2.json (Note: the vs is an actual commandline argument)
so it is possible that the large improvement in compile time is due to we broke the pipeline and due to this less optimization was possible/executed.
Unsure, but you can verify your theory running the code passing -time-passes to opt and compare the times (please note that it doesn't really take in account analyses in some cases [or accounts them to other passes], but it should be good enough to start)
That's one possibility. There also is quite likely an issue Rust is hitting that the later positioning of these passes doesn't help with. It's hard or impossible to tell which is dominant here....
The only thing I'm confident in is what Eli said: this positioning *does* cripple the intended design of the core LLVM optimization pipeline and will severely hurt inlining and optimization quality on some code patterns. (Others, of course, will be largely unimpacted.)
Did you run any benchmarks to conclude that it is faster? (The referenced PR seems to only mention some specific IR generated by Rust IIUC)