This is an archive of the discontinued LLVM Phabricator instance.

Speedup dsymutil when working with big project.
Needs ReviewPublic

Authored by C-_-fan on Mar 4 2022, 9:07 AM.

Details

Summary

It too slow to generate dSYM file, when project is big.
Debug show that MainBinarySymbolAddresses.size() is 2,643k;

I replace linear lookup with a map to speed it up;
In my project it took 2.5 minutes than 30 minutes before.

Diff Detail

Event Timeline

C-_-fan created this revision.Mar 4 2022, 9:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 9:07 AM
C-_-fan requested review of this revision.Mar 4 2022, 9:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 9:07 AM

This only becomes a problem when you have a huge number of public symbols. I bet the majority of these symbols are weak exports from linkeonce_odr functions. Could you change your project to use something like -fvisibility=hidden to hide symbols by default? Doing that has the added benefit of saving the dynamic loader a bunch of work and improving launch time.

I ran some benchmarks on clang, and indeed, this patch actually slows things down:

Baseline:

Time (mean ± σ):     86.446 s ±  0.274 s    [User: 124.955 s, System: 4.702 s]
Range (min … max):   86.075 s … 86.932 s    10 runs

With this patch:

Time (mean ± σ):     88.012 s ±  0.502 s    [User: 127.594 s, System: 5.156 s]
Range (min … max):   87.397 s … 89.001 s    10 runs
llvm/lib/Support/StringRef.cpp
199–209 ↗(On Diff #413031)

I'm not sure I understand what this is trying to achieve. Changes to support should go in a separate patch with its own description/motivation and unit test coverage.

lebedev.ri added inline comments.
llvm/tools/dsymutil/MachODebugMapParser.cpp
58

std::unordered_map has bad performance characteristics.
Does it help if this is instead a SmallDenseMap of SmallVector's?

C-_-fan added inline comments.Mar 4 2022, 5:09 PM
llvm/lib/Support/StringRef.cpp
199–209 ↗(On Diff #413031)

Thank you for your review and comment.

In profile results, rfind method takes a large proportion of time, because substr(I, N) is slow;

If substr(I, N) equal to Str, character at (Data+ I) should be equal to the first character of Str.

llvm/tools/dsymutil/MachODebugMapParser.cpp
58

Thank you for your review and comment, I'll update and test later.

C-_-fan updated this revision to Diff 413209.EditedMar 5 2022, 5:06 AM

I am abandoned the change about rfind, and replaced std::unordered_map<uint64_t, std::vector<StringRef>> with SmallDenseMap<uint64_t, SmallVector<StringRef>>;

In my test, now it take more time because rfind performance rollback; but SmallDenseMap and SmallVector is fast than unordered_map and vector.

I will try -fvisibility=hidden next week.

Please reivew it agagin. If can not merge still, how can I close this revision? Or this will close auto after timeout?

Thanks again.

C-_-fan updated this revision to Diff 413359.Mar 7 2022, 12:26 AM

Whether or not can I find a threshold to divide the code into 2 branches? One branch is use to processing the normal project and the other one is use to processing the overlong project.

C-_-fan updated this revision to Diff 413374.Mar 7 2022, 1:39 AM
C-_-fan updated this revision to Diff 413701.Mar 7 2022, 9:40 PM
C-_-fan removed subscribers: lebedev.ri, hiraditya, dexonsmith.

I just came in to comment on the API change — seems like ArrayRef would be better! — but a spotted a couple of other things.

(I don't have much context on the logic here in dsymutil, so these are just suggestions; probably @JDevlieghere or @lebedev.ri will need to LGTM; but thought I'd share what I noticed.)

llvm/tools/dsymutil/MachODebugMapParser.cpp
58

You could refine this a bit I think:

Firstly, storing SmallVector<StringRef> on the heap is a bit questionable, since SmallVector includes non-trivial small storage. Probably you want SmallVector<StringRef, 1> to customize the small storage for the common case (assuming 1 symbol per address is the common case?).

Secondly, StringRef is the size of two pointers, but you could use a StringMapEntry<uint64_t> * into MainBinarySymbolAddresses to halve the storage. You'd want to update:

if (Extern)
  MainBinarySymbolAddresses[Name] = Addr;
else
  MainBinarySymbolAddresses.try_emplace(Name, Addr);

to:

auto I = MainBinarySymbolAddresses.try_emplace(Name, Addr);
// Replace the na
if (Extern && !I.second)
  I.first->second = Addr;
if (Extern || I.second)
  MainBinaryAddressses2Names[Addr].push_back(&*I.first);

Writing that suggested code, I have reproduced what I think is a logic bug, where the API now returns something different than before your patch. See my other inline comment.

Thirdly, I think you could probably do something complicated with a flat SmallVector and augment the data in MainBinarySymbolAddresses with two uint32_ts to indicate the range within it. More complicated, but probably less memory overall, faster for accessing, and maybe faster for building (depending on various characteristics). Here is the basic model:

  • Change MainBinarySymbolAddresses to store a struct that has an uint32_t index; maybe called SymbolAddressData; index starts out as -1U.
  • During the loop, build up SmallVector<StringMapEntry<SymbolAddressData> *> as you go, only pushing back on insertions.
  • Use a stable sort to make the vector ordered by pointed-address, so aliases are adjacent to each other.
  • Do a pass through the vector. For each address:
    • If there's only one symbol with that address, filter it out, and leave SymbolAddressData pointing at -1U.
    • Else, update each SymbolAddressData to point at the first index into the vector with that address.
  • Update getMainBinarySymbolNames():
    • If the index is -1U, return itself.
    • Else, iterate through the vector starting from the saved index.

Probably not worth doing, but if there are concerns about memory overhead, this would reduce the overhead significantly.

(Maybe a simpler alternative would be to skip caching the index in the StringMap. Instead, do a std::lower_bound() to find the first symbol with the same address. If that first symbol is end() or has the wrong address, then you have the -1U case. This might be faster overall, depending on the workload, especially if the common case is "one symbol per address", since the vector will end up being pretty small after filtering and the binary search will be trivially fast.)

83

An API like this should probably return an ArrayRef<StringRef> (immutable; doesn't leak the internal storage type) rather than SmallVector<StringRef>. Although my other inline comments point in a different direction...

594–595

It's not obvious to me that MachODebugMapParser::getMainBinarySymbolNames() is going to return the same things as before. If this replaces a different Addr, then there will be two different Addrs that have this symbol.

I'm not sure if the precise, existing behaviour is important, but if it is, I think you'll need a second pass through. I think something like this would work (outside/after the for loop that's building up these data structures):

for (auto &AliasList : MainBinarySymbolAddress2Name) {
  AliasList.second.erase(
      llvm::remove_if(AliasList.second,
          [&](StringMapEntry<uint64_t> *Entry) {
            return Entry->second != AliasList.first;
          }),
      AliasList.second.end());
}