Use TaskMapOverInt to demangle the symbols in parallel. Defer categorization of C++ symbols until later, when it can be determined what contexts are definitely classes, and what might not be.
Diff Detail
- Repository
- rL LLVM
Event Timeline
source/Symbol/Symtab.cpp | ||
---|---|---|
257 | The function looks big enough to deserve a name. (== Please move the lambda out of line) |
source/Symbol/Symtab.cpp | ||
---|---|---|
257 | ok but now's your chance to review it as a diff rather than a sea of red and green.... |
sea of red and green
Be explicit about what the lambda has access to (and make the inputs const) to prevent concurrency bugs.
What are the measured improvements here? We can't slow things down on any platforms. I know MacOS didn't respond well to making demangling run in parallel. I want to see some numbers here. And please don't quote numbers with tcmalloc or any special allocator unless you have a patch in LLDB already to make this a build option.
source/Symbol/Symtab.cpp | ||
---|---|---|
233–239 | Not being able to search for symbols by name when they are trampolines? If you lookup symbols by name I would expect things not to fail and I would expect that I would get all the answers, not just ones that are omitted for performance reasons. I would also not expect to have to specify extra flags. Please remove |
Without tcmalloc, on Ubuntu 14.04, 40 core VM: 13%
With tcmalloc, on Ubuntu 14.04, 40 core VM: 24% (built using cmake ... -DCMAKE_EXE_LINKER_FLAGS=-ltcmalloc_minimal, which amazingly only works when building with clang, not gcc...)
I don't have access to a Mac, and of course YMMV depending on the application.
so yeah, it's a bigger improvement with tcmalloc. Interestingly, I tried adding back the demangler improvements I had queued up (which reduced memory allocations), and they didn't matter much, which makes me think this is contention allocating const strings. I could be wrong though.
By far the biggest performance improvement I have queued up is loading the shared libraries in parallel (D32597), but that's waiting on pulling parallel_for_each from LLD into LLVM (I think).
If you're really leery of this change, I could just make the structural changes to allow parallelization, and then keep a small patch internally to enable it. Or enabling it could be platform dependent. Or ... ?
source/Symbol/Symtab.cpp | ||
---|---|---|
233–239 | This is just moved code, not new code. You can use the phabricator history tool above to diff baseline against #97908, and see that the only change I made was continue->return, due to changing it from a loop to a lambda (now a separate function). This is why I pubished D32708 separately - I wanted to separate the functional change from the structural change. |
Without tcmalloc, on Ubuntu 14.04, 40 core VM: 13%
With tcmalloc, on Ubuntu 14.04, 40 core VM: 24% (built using cmake ... -DCMAKE_EXE_LINKER_FLAGS=-ltcmalloc_minimal, which amazingly only works when building with clang, not gcc...)
Do you have a brief set of steps you use for benchmarking? I'd like to compare on FreeBSD using a similar test.
time lldb -b -o 'b main' -o 'run' /my/program
(sometimes I use 'perf stat' instead of time; I don't know if FreeBSD has something similar to Linux's perf - basically instruction count, cycle count, branch counts and mispredict rate, etc.)
my program happens to have a lot of symbols and a lot of libraries. I tried this benchmark with lldb itself, and all but one of my changes have no effect because lldb only links in one large library, so YMMV depending on the application.
Not being able to search for symbols by name when they are trampolines? If you lookup symbols by name I would expect things not to fail and I would expect that I would get all the answers, not just ones that are omitted for performance reasons. I would also not expect to have to specify extra flags. Please remove