Page MenuHomePhabricator

Parallelize demangling
Needs ReviewPublic

Authored by scott.smith on May 3 2017, 11:54 AM.

Details

Reviewers
clayborg
jingham
Summary

Use TaskMapOverInt to demangle the symbols in parallel. Defer categorization of C++ symbols until later, when it can be determined what contexts are definitely classes, and what might not be.

Diff Detail

Repository
rL LLVM

Event Timeline

scott.smith created this revision.May 3 2017, 11:54 AM

This commit uses TaskMapOverInt from change D32757, which hasn't landed yet.

scott.smith updated this revision to Diff 97908.May 4 2017, 8:44 PM
  1. git clang-format
  2. new TaskMapOverInt api (begin, end, fn instead of end, batch, fn)
labath added a subscriber: labath.May 5 2017, 4:08 AM
labath added inline comments.
source/Symbol/Symtab.cpp
257

The function looks big enough to deserve a name. (== Please move the lambda out of line)

scott.smith added inline comments.May 5 2017, 7:20 AM
source/Symbol/Symtab.cpp
257

ok but now's your chance to review it as a diff rather than a sea of red and green....

scott.smith updated this revision to Diff 97970.May 5 2017, 9:22 AM
scott.smith marked 2 inline comments as done.

sea of red and green
Be explicit about what the lambda has access to (and make the inputs const) to prevent concurrency bugs.

clayborg edited edge metadata.May 5 2017, 9:52 AM

What are the measured improvements here? We can't slow things down on any platforms. I know MacOS didn't respond well to making demangling run in parallel. I want to see some numbers here. And please don't quote numbers with tcmalloc or any special allocator unless you have a patch in LLDB already to make this a build option.

source/Symbol/Symtab.cpp
233–239

Not being able to search for symbols by name when they are trampolines? If you lookup symbols by name I would expect things not to fail and I would expect that I would get all the answers, not just ones that are omitted for performance reasons. I would also not expect to have to specify extra flags. Please remove

scott.smith marked an inline comment as done.May 5 2017, 10:50 AM

What are the measured improvements here? We can't slow things down on any platforms. I know MacOS didn't respond well to making demangling run in parallel. I want to see some numbers here. And please don't quote numbers with tcmalloc or any special allocator unless you have a patch in LLDB already to make this a build option.

Without tcmalloc, on Ubuntu 14.04, 40 core VM: 13%
With tcmalloc, on Ubuntu 14.04, 40 core VM: 24% (built using cmake ... -DCMAKE_EXE_LINKER_FLAGS=-ltcmalloc_minimal, which amazingly only works when building with clang, not gcc...)

I don't have access to a Mac, and of course YMMV depending on the application.

so yeah, it's a bigger improvement with tcmalloc. Interestingly, I tried adding back the demangler improvements I had queued up (which reduced memory allocations), and they didn't matter much, which makes me think this is contention allocating const strings. I could be wrong though.

By far the biggest performance improvement I have queued up is loading the shared libraries in parallel (D32597), but that's waiting on pulling parallel_for_each from LLD into LLVM (I think).

If you're really leery of this change, I could just make the structural changes to allow parallelization, and then keep a small patch internally to enable it. Or enabling it could be platform dependent. Or ... ?

source/Symbol/Symtab.cpp
233–239

This is just moved code, not new code. You can use the phabricator history tool above to diff baseline against #97908, and see that the only change I made was continue->return, due to changing it from a loop to a lambda (now a separate function).

This is why I pubished D32708 separately - I wanted to separate the functional change from the structural change.

emaste added a subscriber: emaste.May 18 2017, 7:56 PM

Without tcmalloc, on Ubuntu 14.04, 40 core VM: 13%
With tcmalloc, on Ubuntu 14.04, 40 core VM: 24% (built using cmake ... -DCMAKE_EXE_LINKER_FLAGS=-ltcmalloc_minimal, which amazingly only works when building with clang, not gcc...)

Do you have a brief set of steps you use for benchmarking? I'd like to compare on FreeBSD using a similar test.

scott.smith marked an inline comment as done.May 22 2017, 12:16 PM

Without tcmalloc, on Ubuntu 14.04, 40 core VM: 13%
With tcmalloc, on Ubuntu 14.04, 40 core VM: 24% (built using cmake ... -DCMAKE_EXE_LINKER_FLAGS=-ltcmalloc_minimal, which amazingly only works when building with clang, not gcc...)

Do you have a brief set of steps you use for benchmarking? I'd like to compare on FreeBSD using a similar test.

time lldb -b -o 'b main' -o 'run' /my/program
(sometimes I use 'perf stat' instead of time; I don't know if FreeBSD has something similar to Linux's perf - basically instruction count, cycle count, branch counts and mispredict rate, etc.)

my program happens to have a lot of symbols and a lot of libraries. I tried this benchmark with lldb itself, and all but one of my changes have no effect because lldb only links in one large library, so YMMV depending on the application.