This is an archive of the discontinued LLVM Phabricator instance.

Lazily sort ScopesWithImportedEntities to speed up LTO linking.
AbandonedPublic

Authored by krasin on Oct 14 2015, 2:02 PM.

Details

Summary

In particular, this CL speeds up the official Chrome linking with LTO by 1.8x.

See more details in https://crbug.com/542426

Diff Detail

Event Timeline

krasin updated this revision to Diff 37388.Oct 14 2015, 2:02 PM
krasin retitled this revision from to Lazily sort ScopesWithImportedEntities to speed up LTO linking..
krasin updated this object.
krasin added reviewers: thakis, pcc.
krasin added subscribers: llvm-commits, kcc.

Very cool! I'm not familiar with this code, so I shouldn't review this. Looks like dblaikie touched this not too long ago, he can probably approve this :-)

pcc added inline comments.Oct 15 2015, 5:40 AM
lib/CodeGen/AsmPrinter/DwarfDebug.h
313–314

You might consider llvm::DenseMap<const MDNode *, llvm::SmallVector<const MDNode *, 4>> as a better data structure for this. That way, you won't need to maintain an ordering.

Please, take another look.

lib/CodeGen/AsmPrinter/DwarfDebug.h
313–314

If tried that. The build time is 5 minutes larger than the code I currently have.
Your proposal might be better for the case, when all debug info is actually needed, but in this case (given the importance of having this part as fast as possible), the right approach would be to store these entries in a vector, and then create a DenseMap, if needed. That would also be faster overall, since the number of buckets for the DenseMap will be known.

I propose to go with the current patch, as it provides the largest speed gains.

dblaikie edited edge metadata.Oct 16 2015, 11:46 AM

When is not all the debug info needed? & how did it end up in the IR if it wasn't needed?

krasin added a comment.EditedOct 16 2015, 1:28 PM

When is not all the debug info needed? & how did it end up in the IR if it wasn't needed?

See lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:342:
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp&q=lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:342&sq=package:chromium&type=cs&l=342

// Skip imported directives in gmlt-like data.                                                                                                                                                                
if (!includeMinimalInlineScopes()) {
  // There is no need to emit empty lexical block DIE.
  for (const auto &E : DD->findImportedEntitiesForScope(DS))
     Children.push_back(
         constructImportedEntityDIE(cast<DIImportedEntity>(E.second)));
}

When the official Chrome is compiled, includeMinimalInlineScopes() returns true, and DwarfDebug::findImportedEntitiesForScope is never called.
It's defined in lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:823:
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp&q=includeMinimalInlineScopes%20file:%5Esrc/third_party/llvm/&sq=package:chromium&type=cs&l=823

bool DwarfCompileUnit::includeMinimalInlineScopes() const {
  return getCUNode()->getEmissionKind() == DIBuilder::LineTablesOnly ||
         (DD->useSplitDwarf() && !Skeleton);
}

At some point, the official Chrome was compiled with LineTablesOnly, then it was realized that stack traces in crash reports are not complete, and I conclude that Chrome uses split Dwarf. See https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/8tKb9-Wr8gk

Update: no, Chrome does not use the split dwarf for the official builds. I was wrong:

https://code.google.com/p/chromium/codesearch#chromium/src/build/common.gypi&q=gsplit-dwarf&sq=package:chromium&type=cs&l=3804

At this time, I have no good explanation, why does not findImportedEntitiesForScope get called. I don't think it's blocking the review, since the CL does not change the logic, it only makes the sorting operation lazy.

I will continue to read the code to get an answer for this question, but please take a look at the CL. It's tested: the Chrome binary is linked faster and bytewise the same as without this change.

krasin added a comment.EditedOct 16 2015, 2:04 PM

I have got the first results from my debug run:

bool DwarfCompileUnit::includeMinimalInlineScopes() const {
  fprintf(stderr, "DwarfCompileUnit::includeMinimalInlineScopes, emission kind: %d\n", (int)getCUNode()->getEmissionKind());
  fprintf(stderr, "useSplitDwarf: %d\n", (int)DD->useSplitDwarf());
  fprintf(stderr, "Skeleton: %p\n", (void*)Skeleton);
  _exit(1);
...
DwarfCompileUnit::includeMinimalInlineScopes, emission kind: 1
useSplitDwarf: 0
Skeleton: (nil)

where emission kind: 1 means FullDebug: https://code.google.com/p/chromium/codesearch#chromium/src/third_party/llvm/include/llvm/IR/DIBuilder.h&q=DIBuilder&sq=package:chromium&type=cs&l=71

Now, I have another hypothesis, which also explains the speed up: instead of sorting after each insertion into the vector, I sort them once.
If that's true, the CL could be slightly simplified. I have started a new Chrome linking process which should provide us with this info in about 15 minutes.

krasin updated this revision to Diff 37650.Oct 16 2015, 3:54 PM
krasin edited edge metadata.

sync

Now, I have a better explanation what happens here.

  1. As in trunk, stable_sort is called many times, requiring to do a lot of unnecessary work after each small update of the vector. The memory gets shuffled, the cache gets polluted, the time is wasted.
  1. This CL only sorts, when it's really needed. While sorting puts the contents of the vector into the cache (thus evicting other useful info), these contents are immediately used by the caller of findImportedEntitiesForScope, so we actually warm up the cache!
  1. I tried to place the sort invocation right after the loop in DwarfDebug::constructAndAddImportedEntityDIE, but then I realized that it's not correct, as findImportedEntitiesForScope could be called from within the loop. Also, that's actually slower than the current CL by 2 minutes.

Please, take another look.

krasin updated this object.Oct 16 2015, 4:00 PM

David, friendly ping. :)

It'd be nice if we could sort immediately after populating - but the issue
you've identified is that we use it within the body of the for(CUs) loop
after we've added them for one CU?

What about if we sunk the data structure down into the DwarfUnit (actually
we should be able to sink it into DwarfCompileUnit - these specific scopes
should never appear in DwarfTypeUnits)? Then we would reduce the size of
the data structure to search through (since each query should know which
unit it's for - effectively bucketizing it) & be able to sort immediately
after populating, since it'll be "populate the map for this CU, sort it,
then never touch it again"?

It'd be nice if we could sort immediately after populating - but the issue
you've identified is that we use it within the body of the for(CUs) loop
after we've added them for one CU?

This is correct.

What about if we sunk the data structure down into the DwarfUnit (actually
we should be able to sink it into DwarfCompileUnit - these specific scopes
should never appear in DwarfTypeUnits)? Then we would reduce the size of
the data structure to search through (since each query should know which
unit it's for - effectively bucketizing it) & be able to sort immediately
after populating, since it'll be "populate the map for this CU, sort it,
then never touch it again"?

I tried that, see http://reviews.llvm.org/D13918. I did't optimize it for speed, it only loses a minute comparing to the current CL anyway.
The problem is that the resultant binary is different. My understanding is that it could not find some of the imported entries for the given scope, because they could end up in a different DwarfCompileUnit.

Have I understood your suggestion right?

krasin abandoned this revision.Oct 22 2015, 11:51 AM

I abandon this revision in favor of http://reviews.llvm.org/D13918
I have moved the collection into DwarfCompilationUnit, as it's the only user for the data.
I also use a DenseMap, as proposed by Peter.