This is an archive of the discontinued LLVM Phabricator instance.

Make dwarf parsing multi-threaded
ClosedPublic

Authored by tberghammer on Oct 12 2015, 10:03 AM.

Details

Summary

Make dwarf parsing multi-threaded

Loading the debug info from a large application is the slowest task
LLDB do. This CL makes most of the dwarf parsing code multi-threaded.

As a result the speed of "attach; backtrace; exit;" when the inferior
is an LLDB with full debug info increased by a factor of 2 (on my machine).

Diff Detail

Repository
rL LLVM

Event Timeline

tberghammer retitled this revision from to Make dwarf parsing multi-threaded.
tberghammer updated this object.
tberghammer added reviewers: labath, clayborg.
tberghammer added a subscriber: lldb-commits.
clayborg edited edge metadata.Oct 12 2015, 11:34 AM

If you have 1000 compile units, will this spawn 1000 threads simultaneously?

It is depending on the implementation of std::async what AFAIK isn't defined by the standard, but I would expect that a decent stl implementation will create a reasonable number of threads (in some sense).

While developing/testing the code (with ~3000 CU in a SymbolFile) I seen that the number of running threads (not blocked in a mutex) was around the number of cores I have (on Linux x86_64 with libstdc++). It was ~60 threads after fixing the mutex in ConstString (D13652) and ~500 before on a 40 core machine but considering that thread creation isn't expensive on Linux we don't have to worry about too much thread if they are blocked anyway (thread creation wasn't significant on the profiling output).

I can create manual thread pool (or write a general thread pool class) but I think we can rely on the standard library until it is proven that it isn't working as we expect.

zturner added inline comments.
source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2074 ↗(On Diff #37129)

Every one of these is locking the same mutex. You could make arrays outside of the async work that is num_compile_units entries, and put each result in its own entry in the array.

After the wait, you could make more async workers. One for each variable. like m_function_base_name_index.Append could be one async job, and the same for each of the other ones.

std::async is fine as long as it doesn't blow out the threads on any supported systems. We should also test doing multiple std::async calls in different places in some test only code to make sure if we run 4 std::async calls at once on different threads that we don't end up launching 4 times as many threads. So as long as std::async is limiting the number of threads globally within a process, we are good to go, else we should be sure to implement the limit using a thread pool. I sent you some code on the side that might take care of that if std::async isn't good enough. Else lets stick to the C++11 library when/where it is sufficient.

fwiw, I know for a fact on Windows the number of threads are limited. So you're good to go here, can't speak for other platforms.

You can probably limit the number of threads portably by using a sempahore that blocks after it's been acquired std::thread::hardware_concurrency() times.

tberghammer edited edge metadata.

Use the new ThreadPool class and make the Append+Finalize stage parallel.

clayborg requested changes to this revision.Oct 14 2015, 10:30 AM
clayborg edited edge metadata.

Missing the TaskPool.h and TaskPool.cpp files?

This revision now requires changes to proceed.Oct 14 2015, 10:30 AM
clayborg accepted this revision.Oct 14 2015, 11:28 AM
clayborg edited edge metadata.

Just saw that patch, so this looks good then pending the other patch.

This revision is now accepted and ready to land.Oct 14 2015, 11:28 AM

BTW: if we can modify clang to produce the Apple accelerator tables, we won't need to do any of this indexing which will really speed up debugging! We only produce the Apple accelerator tables on Darwin, but we could on other systems. There is also a new version of the accelerator tables that is going to be in DWARF 5 that is a modified version of our Apple accelerator tables. The Apple accelerator tables are actual accelerator tables that can be mmap'ed in and used as is. All other DWARF accelerator tables are actually not accelerator tables, they are randomly ordered tables that need to be sorted and ingested and often don't contain the correct things that a debugger wants. Like ".debug_pubtypes" will only mention "public" types. Any private types are not in the table. So the table is useless. Same goes for "debug_pubnames": only "public" names... Useless. So our new accelerator tables actually have all of the data in a format that can be used as is, no extra sorting required. They really speed up debugging and stop us from having to index the DWARF manually.

This revision was automatically updated to reflect the committed changes.
clayborg added inline comments.Oct 20 2015, 10:21 AM
lldb/trunk/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2087–2088

So we are still going to serially wait for the each item in the task list to complete? Don't we want to use TaskRunner::WaitForNextCompletedTask() here?

I reverted this change, as it caused some race condition, but see my comment inline.

lldb/trunk/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2087–2088

I don't see any benefit for using TaskRunner::WaitForNextCompletedTask() here because we can't really do anything when only a few task is completed and using TaskRunner ads an extra layer of indirection (to implement WaitForNextCompletedTask) what have a very minor performance hit.

One possible improvement we can do is to do the merging of the indexes on the main thread while we are waiting for the parsing tasks to complete, but I am not sure if it will have any performance benefit as it would mean that we do it on a single thread instead of 9 threads we are doing it now (with TaskPool::RunTasks).

See inlined comments.

lldb/trunk/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2087–2088

Seems like you could change the future to just return the cu_idx from parser_fn so we can append all items to the member variables in the main thread:

while (uint32_t cu_idx : task_runner. WaitForNextCompletedTask())
{
     m_function_basename_index.Append(function_basename_index[cu_idx));
     m_function_fullname_index.Append(function_fullname_index[cu_idx));
     ...
}

Otherwise you are serializing the merging + finalize to be at the end. One nice thing about doing it the way you are doing is that it will be consistent from run to run as the data will always appear in the same order as the debug info file. But these maps should be the same regardless and the oder in which the data comes in shouldn't affect the final content.

2090–2095

If you do the Append() calls inside the while loop above, then all we need to do it call Finalize() on each member variable below.

tberghammer edited edge metadata.
tberghammer removed rL LLVM as the repository for this revision.

I tried out the implementation you suggests and made some measurements. The difference between the 2 implementation when attaching to LLDB is negligible (the total time spent in SymbolFileDWARF::Index differed by ~1%). The interesting part is that your implementation is faster for parsing C++ libraries (e.g. liblldb, libstdc++) while mine implementation is faster for parsing C libraries (libc, libm, libdl) and I don't understand why it is happening.

With the current measurements in place I don't feel strongly about any version, so if somebody have a strong preference then please let me know.

I plan to recommit this change after committing D13940 as that one fixes the remaining race conditions related to this change I found so far with TSAN

I would venture to say we should optimize for C++ since those libraries tend to be larger, but I will leave the decision to you.

I decided to go with your approach primarily because I tried it out with lower number of threads and it performed marginally better (~10%) in that case