This is an archive of the discontinued LLVM Phabricator instance.

Create synthetic symbol names on demand to improve memory consumption and startup times.
ClosedPublic

Authored by clayborg on Jun 17 2021, 2:31 PM.

Details

Summary

This fix was created after profiling the target creation of a large C/C++/ObjC application that contained almost 4,000,000 redacted symbol names. The symbol table parsing code was creating names for each of these synthetic symbols and adding them to the name indexes. The code was also adding the object file basename to the end of the symbol name which doesn't allow symbols from different shared libraries to share the names in the constant string pool.

Prior to this fix this was creating 180MB of "___lldb_unnamed_symbol" symbol names and was taking a long time to generate each name, add them to the string pool and then add each of these names to the name index.

This patch fixes the issue by:

  • not adding a name to synthetic symbols at creation time, and allows name to be dynamically generated when accessed
  • doesn't add synthetic symbol names to the name indexes, but catches this special case as name lookup time. Users won't typically set breakpoints or lookup these synthetic names, but support was added to do the lookup in case it does happen
  • removes the object file baseanme from the generated names to allow the names to be shared in the constant string pool

Prior to this fix the startup times for a large application was:
12.5 seconds (cold file caches)
8.5 seconds (warm file caches)

After this fix:
9.7 seconds (cold file caches)
5.7 seconds (warm file caches)

The names of the symbols are auto generated by appending the symbol's UserID to the end of the "___lldb_unnamed_symbol" string and is only done when the name is requested from a synthetic symbol if it has no name.

Diff Detail

Event Timeline

clayborg created this revision.Jun 17 2021, 2:31 PM
clayborg requested review of this revision.Jun 17 2021, 2:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2021, 2:31 PM

lgtm

I wonder if it's lldb's style to rename m_mangled to m_mangled_do_not_use (or something like that) to prevent people from using it in the future.

lldb/source/Symbol/Symtab.cpp
625

these queries

clayborg updated this revision to Diff 353120.Jun 18 2021, 4:04 PM

Fixed a case where the accessor was being used twice in the same function and fixed a typo in a comment.

wallace accepted this revision.Jun 18 2021, 5:25 PM
This revision is now accepted and ready to land.Jun 18 2021, 5:25 PM
jingham added a comment.EditedJun 21 2021, 10:12 AM

Are the Symbol ID's for unnamed symbols the same each time you read in a symbol file? While the unnamed_symbol symbol names are not significant, it would be good if you were crashing in __lldb_unnamed_symbol111 on one lldb run, you would also crash in the same unnamed symbol when you crashed again.

shafik added a subscriber: shafik.Jun 21 2021, 4:06 PM
shafik added inline comments.
lldb/include/lldb/Symbol/Symtab.h
222

We should add Doxygen comment for this member function. I know we are not consistent with doing this but for new stuff we should do this and fix when we can we refactoring. Thank you!

I thought about this because I noticed we are returning 0 and we had an explicit comment about what it meant and this is where it really belongs.

I also noticed we use UINT32_MAX but we don't seem to have an alias for that either.

lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
2820–2825

Since we are touching this can we move to using parameter name in the comment style as documented here: https://llvm.org/docs/CodingStandards.html#comment-formatting

We even have a clang-tidy check to verify this: https://clang.llvm.org/extra/clang-tidy/checks/bugprone-argument-comment.html

e.g./*name=*/llvm::StringRef(), /*type=*/eSymbolTypeCode, ...

lldb/source/Symbol/Symtab.cpp
635

/*Radix=*/10

Are the Symbol ID's for unnamed symbols the same each time you read in a symbol file? While the unnamed_symbol symbol names are not significant, it would be good if you were crashing in __lldb_unnamed_symbol111 on one lldb run, you would also crash in the same unnamed symbol when you crashed again.

Yes, symbol IDs are consistent as they encode the UserID of the symbol as the number which will be the same on each run as long as the binary doesn't change. The UserID for synthetic symbols always start with the last valid actual symbol index from the main symbol table. So the numbers are just as good as they are before, they just don't start at 1 anymore, the start at the size of the actual symbol table.

Are the Symbol ID's for unnamed symbols the same each time you read in a symbol file? While the unnamed_symbol symbol names are not significant, it would be good if you were crashing in __lldb_unnamed_symbol111 on one lldb run, you would also crash in the same unnamed symbol when you crashed again.

Yes, symbol IDs are consistent as they encode the UserID of the symbol as the number which will be the same on each run as long as the binary doesn't change. The UserID for synthetic symbols always start with the last valid actual symbol index from the main symbol table. So the numbers are just as good as they are before, they just don't start at 1 anymore, the start at the size of the actual symbol table.

Excellent, the specific number is unimportant so long as they are consistent. This seems a side effect of they way they are computed, might be good to drop a judicious comment somewhere saying why it's important...

clayborg updated this revision to Diff 353550.Jun 21 2021, 10:28 PM

Fix comments from review:

  • Add in-line C-style comment for parameter to symbol creation.
  • Fixed the entry point address to encode the offset correctly for the ELF e_entry address
  • Add header doc for the new Symtab function
clayborg marked 3 inline comments as done.Jun 21 2021, 10:29 PM
clayborg added inline comments.
lldb/source/Symbol/Symbol.cpp
571–580

This is the comment Jim Ingham asked for.

Some comments on comments...

lldb/include/lldb/Symbol/Symtab.h
224

This comment is hard to read. I think it's mostly because you describe the implementation before the reason for it.

Maybe this would be clearer like:

"We generate unique names for synthetic symbols so that users can look them up by name when needed. But because doing so is uncommon in normal debugger use, we trade off some performance at lookup time for faster symbol table building by detecting these symbols and generating their names lazily, rather than adding them to the normal symbol indexes. This function does the job of first consulting the indexes, and if that fails checking whether the symbol has the synthetic symbol prefix and generating the correct synthetic name if it does.

lldb/source/Symbol/Symbol.cpp
573

are -> is

or maybe:

starts with the synthetic symbol prefix, followed by a unique number

574

is -> of so this reads:

Typically the UserID of a real symbol is ...

576

I don't think you need the implementation detail here, you are stating policy. Starting from "Typically" I think something like the following is more direct:

Typically the UserID of a real symbol is the symbol table index of the symbol in the object file's symbol table(s), so it will be the same every time you read in the object file. We want the same persistence for synthetic symbols so that users can identify them across multiple debug sessions, to understand crashes in those symbols and to reliably set breakpoints on them.

lldb/source/Symbol/Symtab.cpp
624

so -> to

clayborg updated this revision to Diff 353715.Jun 22 2021, 10:59 AM

Fixed comments per Jim Ingham's comments.

Anyone have any other comments? Or is this good to go?

Anyone have any objections now?

I'm fine with the change.

Sounds good, can someone accept the patch?

wallace accepted this revision.Jun 23 2021, 5:14 PM

good to go

aprantl added inline comments.Jun 25 2021, 12:18 PM
lldb/source/Symbol/Symtab.cpp
638

if (!symbol) ?

This revision was landed with ongoing or failed builds.Jun 28 2021, 6:05 PM
This revision was automatically updated to reflect the committed changes.

This broke the windows lldb bot and the follow up changes did not fix it:

https://lab.llvm.org/buildbot/#/builders/83/builds/7748

Can you have a look?