Page MenuHomePhabricator

Fix PR26055 - LiveDebugValues is very slow
ClosedPublic

Authored by aprantl on May 11 2016, 11:43 AM.

Details

Summary

This path modifies the LiveDebugValues pass to use more efficient set datastructures as outlined in https://llvm.org/bugs/show_bug.cgi?id=26055.

This original revision is Daniel's original patch.
I'll follow-up with some bug fixes on top of it soon.

Diff Detail

Event Timeline

aprantl updated this revision to Diff 44767.May 11 2016, 11:43 AM
aprantl retitled this revision from to Fix PR26055 - LiveDebugValues is very slow.
aprantl updated this object.
aprantl added a subscriber: llvm-commits.
aprantl updated this revision to Diff 56947.May 11 2016, 11:58 AM

This fixes the remaining bugs in the original patch and adds some cleanups.

  • [Correctness] The VarLoc::operator<() implementation now takes semantic equality between two VarLocs (that have different MI pointers in them) into account.
  • [Correctness] Both operator<() implementations now check for equality of the first pointer before comparing the second pointer.
  • In join(), I'm using an in-place llvm::set_intersect().
  • [Debuggability] printVarLocInMBB() now sorts the output in BB order
  • The order of the DEBUG_VALUEs is different when they come from a set, so I needed to update two of our testcases.

Before this patch, a particularly nasty ASANified testcase that I had took

user	9m20.248s

After this patch, the same testcase compiles in

user	3m12.294s

Definitely a big win! I'll look into the Bitvectors next, but I think we can also land this version of the patch meanwhile.

Thanks for taking on this work!

I'm curious:
Does your new patch + correctness fixes take > 2 iterations to converge on
the .ll file?
(If so, something is still wrong :P)

Good Point, and no, it doesn't converge!
In retrospect it is pretty obvious that VarLoc::getHashValue() still cannot be right. I'll fix that.

aprantl updated this revision to Diff 56968.May 11 2016, 2:51 PM

Fixed VarLoc::getHashValue().

On trunk the testcase from the PR converges in 3 iterations and ~5s;
with this patch it converges in 3 iterations and ~4.8s.

weimingz added inline comments.May 11 2016, 3:09 PM
lib/CodeGen/LiveDebugValues.cpp
422

Can we bail out if no coverage after certain iterations?

dberlin edited edge metadata.May 11 2016, 4:40 PM
dberlin added a subscriber: dberlin.

The number of iterations should be loop depth + 3, maximum

aprantl edited edge metadata.May 13 2016, 10:09 AM
aprantl added a subscriber: wolfgangp.

Ok, sorry, that was a bit silly on my end: The example from the PR has two functions, and I "measured" by adding a print after every Worklist.swap(Pending);
After adding a second print upon entering a new function the full printout now reads:

entering function_Z3fooRKN2cv3MatEPhRKNS_8KeyPointE
iteration completed
iteration completed
entering functionasan.module_ctor
iteration completed

So, yes, it does indeed converge after two iterations.

Awesome. That's a pretty good sign it's correct :)

aprantl updated this revision to Diff 57297.May 15 2016, 9:24 AM

This version replaces the data structures with SparseBitVectors and adds a couple of further optimizations.
Apart from making the algorithm much easier to read, this now brings the ASAN example I've been primarily testing on down to just 8 seconds(!) in a NoAsserts build (while producing an identical object file :-).

Some of the optimizations I added on top are:

  • VarLoc now explicitly stores the register location for faster comparisons.
  • transferDebugValue no longer checks the inlinedAt field. This makes a huge difference because looking up the inline scope of a MachineInstr is quite slow. This means that we can't propagate more than one inlined copies of the same variable. IMHO this is an acceptable trade-off. Note that this only affects accuracy, not correctness.
  • I experimented with many different data structures (std::list, SmallDenseMap, SmallBitVector) for VarLocList, but settled on a SparseBitVector, because it makes the InLocs |= OpenRanges operation very natural. Iterating over the set bits in a SparseBitVector can be expensive, but it doesn't seem to make much of a difference in practice.

I think this is as far as I want to go with optimizations for PR26055.
Once we add support for more differnet kinds of debug (value) locations, it might make sense to revisit this.

dberlin added inline comments.May 15 2016, 10:40 AM
lib/CodeGen/LiveDebugValues.cpp
252

Errr, this seems like a regression :P
Also, if you want to float this idea, IMHO, you should split up this patch:

First, make it do the same thing with sparsebitvectors
Second, make any functionality/design changes you want on top of that.

276

Are you positive of the invalidation semantics of sparsebitvector iterators?

In particular, what happens if the reset call causes the current element to go away?
I do believe that will cause the iterator to be invalidated.

(Also, this is N^2)

286

Ditto iterator invalidation here.

You have have to make a temporary sparsebitmap and subtract it (intersectwithcomplement)

davide added a subscriber: davide.May 15 2016, 1:03 PM
davide added inline comments.
lib/CodeGen/LiveDebugValues.cpp
52

Can you please add a message to these assertions? Thanks!

davide added a comment.EditedMay 15 2016, 1:40 PM

First, let me tell you that I'm very excited about this patch. During an LTO build of medium-to-large sized C++ applications, I can clearly see this pass showing up in the profile, so I decided to benchmark it. Unfortunately, I think the patch as is introduces a regression :(

#0 0x00000000005e4398 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/davide/work/build-llvm/bin/lld+0x5e4398) 0x00000000005e1eee llvm::sys::RunSignalHandlers() (/home/davide/work/build-llvm/bin/lld+0x5e1eee)
#2 0x00000000005e2136 SignalHandler(int) (/home/davide/work/build-llvm/bin/lld+0x5e2136)
#3 0x00007f71573309f0 __restore_rt (/lib64/libpthread.so.0+0x109f0)
#4 0x000000000141867c (anonymous namespace)::LiveDebugValues::transferRegisterDef(llvm::MachineInstr&, llvm::SparseBitVector<128u>&, llvm::UniqueVector<(anonymous namespace)::LiveDebugValues::VarLoc> const&) [clone .isra.178] [clone .constprop.275] (/home/davide/work/build-llvm/bin/lld+0x141867c)
#5 0x000000000141a827 (anonymous namespace)::LiveDebugValues::ExtendRanges(llvm::MachineFunction&) [clone .constprop.265] (/home/davide/work/build-llvm/bin/lld+0x141a827)
#6 0x0000000001479b65 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/home/davide/work/build-llvm/bin/lld+0x1479b65)
#7 0x0000000001c3a0b2 llvm::FPPassManager::runOnFunction(llvm::Function&) (/home/davide/work/build-llvm/bin/lld+0x1c3a0b2)
#8 0x0000000001c3a41b llvm::FPPassManager::runOnModule(llvm::Module&) (/home/davide/work/build-llvm/bin/lld+0x1c3a41b)
#9 0x0000000001c39bfe llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/davide/work/build-llvm/bin/lld+0x1c39bfe)
#10 0x00000000014dae09 codegen(llvm::Module*, llvm::raw_pwrite_stream&, std::function<std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine> > ()>, llvm::TargetMachine::CodeGenFileType) [clone .constprop.50] (/home/davide/work/build-llvm/bin/lld+0x14dae09)
#11 0x00000000014db08a llvm::splitCodeGen(std::unique_ptr<llvm::Module, std::default_delete<llvm::Module> >, llvm::ArrayRef<llvm::raw_pwrite_stream*>, llvm::ArrayRef<llvm::raw_pwrite_stream*>, std::function<std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine> > ()> const&, llvm::TargetMachine::CodeGenFileType, bool) (/home/davide/work/build-llvm/bin/lld+0x14db08a)
#12 0x00000000005bb240 lld::elf::BitcodeCompiler::runSplitCodegen(std::function<std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine> > ()> const&) (/home/davide/work/build-llvm/bin/lld+0x5bb240)
#13 0x00000000005bc827 lld::elf::BitcodeCompiler::compile() (/home/davide/work/build-llvm/bin/lld+0x5bc827)
#14 0x000000000057ab45 lld::elf::SymbolTable<llvm::object::ELFType<(llvm::support::endianness)1, true> >::addCombinedLtoObject() (/home/davide/work/build-llvm/bin/lld+0x57ab45)
#15 0x00000000004ecfb8 void lld::elf::LinkerDriver::link<llvm::object::ELFType<(llvm::support::endianness)1, true> >(llvm::opt::InputArgList&) (/home/davide/work/build-llvm/bin/lld+0x4ecfb8)
#16 0x000000000044e7ec lld::elf::LinkerDriver::main(llvm::ArrayRef<char const*>) (/home/davide/work/build-llvm/bin/lld+0x44e7ec)
#17 0x00000000004eefbe lld::elf::link(llvm::ArrayRef<char const*>, llvm::raw_ostream&, bool) (/home/davide/work/build-llvm/bin/lld+0x4eefbe)
#18 0x000000000044dce7 main (/home/davide/work/build-llvm/bin/lld+0x44dce7)
#19 0x00007f71564ce580 __libc_start_main (/lib64/libc.so.6+0x20580)
#20 0x00000000004a03b9 _start (/home/davide/work/build-llvm/bin/lld+0x4a03b9)
Stack dump:
0.      Program arguments:
1.      Running pass 'Function Pass Manager' on module 'ld-temp.o'.
2.      Running pass 'Live DEBUG_VALUE analysis' on function '@_Z10iPPc'

I have limited time next week, but I'll try anyway to reduce a test case for you ASAP.

My guess is you have a case where the iterator invalidation causes it to
crash :)

dberlin added inline comments.May 16 2016, 9:40 AM
lib/CodeGen/LiveDebugValues.cpp
276

One way to solve the invalidation problem, other than creating a separate bitvector, would be to not use the auto loop, but instead to always move the iterator forward and erase behind the iterator.

Two, you can solve the O(N^2)-ness a few ways (it's really worse, but let's just deal with the part going over the open ranges):

A simple way -

Form a bitvector out of this:

for (const MachineOperand &MO : MI.operands()) {
   if (MO.isReg() && MO.isDef() && MO.getReg() &&
       TRI->isPhysicalRegister(MO.getReg())) {
     // Remove ranges of all aliased registers.
     for (MCRegAliasIterator RAI(MO.getReg(), TRI, true); RAI.isValid(); ++RAI)
         <set bit for *RAI here>

Then take this loop:

for (unsigned ID : OpenRanges)
  if (VarLocIDs[ID].isDescribedByReg() == *RAI)

and do the tests against the bitvector instead (ie test the bit VarLocIDs[ID].isDescribedByReg())
This will only process each range once instead of process each range for every operand :)

This is essentially splitting it into:

  1. Figure out which registers the machine instruction affects
  2. Test which open ranges they affect

You can do things that are more complicated if this part gets slow.

286

You could also just move the iterator ahead of doing the reset.

aprantl updated this revision to Diff 57370.May 16 2016, 10:55 AM

To establish a useful basis for discussion and further optimization, here is an NFC patch (no dropping of inlined copies) that also fixes the iterator invalidation problem noted by Daniel and reported by Davide.

I'm now going to look into Daniel's suggestions about improving the O(n^2) inner loops. Thanks!

Note: If the time bound on the version with N^2 inner loops is beyond what
you want to do, happy to review the patch as-is and get it in, and we can
just drop a comment about further improvement

Note: If the time bound on the version with N^2 inner loops is beyond what
you want to do, happy to review the patch as-is and get it in, and we can
just drop a comment about further improvement

Thanks, I think it would be a good idea to make a cut here and land this revision in trunk as an intermediate step so we can use it as a basis for further improvements!

I'll keep working on it, but doing so takes time, and I keep getting distracted by other things.

As for my progress: I'm done with my first iteration of rewriting transferRegisterDef() and it had almost no effect on my benchmark. This is because in my benchmark none of the variables survive the first basic block, which explains why I couldn't measure any improvements. I'll need to find a better benchmark to profile my bitvectorized transferRegisterDef() implementation.

So where are the remaining 2 minutes spent? With some more profiling I figured that most of the time is spent in transferDebugValue(). In my benchmark OpenRanges has >5000 elements (mostly from inlined C++ one-liners) so linearly iterating over all of them for every DEBUG_VALUE is very expensive. My idea is to exploit the sorted-ness of the map in the UniqueVector to only iterate over only over the elements that share the same DebugVariable.

I'll post a separate review once I've got something presentable.

mgrang added a subscriber: mgrang.May 25 2016, 1:03 PM
mgrang added inline comments.
lib/CodeGen/LiveDebugValues.cpp
24

Includes should be alphabetically ordered:

#include "llvm/ADT/SparseBitVector.h" should come before #include "llvm/ADT/Statistic.h"

LGTM

lib/CodeGen/LiveDebugValues.cpp
55

Nit: indirectly

103

Errr, do you really never use the hash and the registerloc at the same time?

aprantl added inline comments.May 25 2016, 1:15 PM
lib/CodeGen/LiveDebugValues.cpp
103

The hash is meant to be a convenient alternative view of the the two uint32_t as a single uint64_t. Now that I think of it, I wonder whether that is guaranteed on all imaginable architectures.

Yeah, i honestly don't know.
It's a need way to avoid constructing a real hash, my next question would
be "is it actually demonstrably slower to compute a hash for real"?

Also, since the stuff is constant anyway, can't you just compute it at init
time anyway?
It's not like you have to compute it repeatedly.

Thanks! I landed an updated version that includes all open suggestions in r270776.

aprantl accepted this revision.May 26 2016, 11:01 AM
aprantl added a reviewer: aprantl.

This can be closed now.

This revision is now accepted and ready to land.May 26 2016, 11:01 AM
aprantl closed this revision.May 26 2016, 11:01 AM