This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
source/Plugins/SymbolFile/DWARF/
-
Plugins/
-
SymbolFile/
-
DWARF/
-
NameToDIE.h
-
NameToDIE.cpp
1
SymbolFileDWARF.cpp

Differential D13662

Make dwarf parsing multi-threaded
ClosedPublic

Authored by tberghammer on Oct 12 2015, 10:03 AM.

Download Raw Diff

Details

Reviewers

clayborg
labath

Commits

rG2ff8870b6f49: Re-commit "Make dwarf parsing multi-threaded"
rGf84916d40fb2: Make dwarf parsing multi-threaded
rLLDB251106: Re-commit "Make dwarf parsing multi-threaded"
rLLDB250821: Make dwarf parsing multi-threaded
rL251106: Re-commit "Make dwarf parsing multi-threaded"
rL250821: Make dwarf parsing multi-threaded

Summary

Make dwarf parsing multi-threaded

Loading the debug info from a large application is the slowest task
LLDB do. This CL makes most of the dwarf parsing code multi-threaded.

As a result the speed of "attach; backtrace; exit;" when the inferior
is an LLDB with full debug info increased by a factor of 2 (on my machine).

Diff Detail

Event Timeline

tberghammer updated this revision to Diff 37129.Oct 12 2015, 10:03 AM

tberghammer retitled this revision from to Make dwarf parsing multi-threaded.

tberghammer updated this object.

tberghammer added reviewers: labath, clayborg.

tberghammer added a subscriber: lldb-commits.

If you have 1000 compile units, will this spawn 1000 threads simultaneously?

It is depending on the implementation of std::async what AFAIK isn't defined by the standard, but I would expect that a decent stl implementation will create a reasonable number of threads (in some sense).

While developing/testing the code (with ~3000 CU in a SymbolFile) I seen that the number of running threads (not blocked in a mutex) was around the number of cores I have (on Linux x86_64 with libstdc++). It was ~60 threads after fixing the mutex in ConstString (D13652) and ~500 before on a 40 core machine but considering that thread creation isn't expensive on Linux we don't have to worry about too much thread if they are blocked anyway (thread creation wasn't significant on the profiling output).

I can create manual thread pool (or write a general thread pool class) but I think we can rely on the standard library until it is proven that it isn't working as we expect.

zturner added a subscriber: zturner.Oct 12 2015, 1:15 PM

zturner added inline comments.

source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2083	Every one of these is locking the same mutex. You could make arrays outside of the async work that is `num_compile_units` entries, and put each result in its own entry in the array. After the wait, you could make more async workers. One for each variable. like `m_function_base_name_index.Append` could be one async job, and the same for each of the other ones.

std::async is fine as long as it doesn't blow out the threads on any supported systems. We should also test doing multiple std::async calls in different places in some test only code to make sure if we run 4 std::async calls at once on different threads that we don't end up launching 4 times as many threads. So as long as std::async is limiting the number of threads globally within a process, we are good to go, else we should be sure to implement the limit using a thread pool. I sent you some code on the side that might take care of that if std::async isn't good enough. Else lets stick to the C++11 library when/where it is sufficient.

fwiw, I know for a fact on Windows the number of threads are limited. So you're good to go here, can't speak for other platforms.

You can probably limit the number of threads portably by using a sempahore that blocks after it's been acquired std::thread::hardware_concurrency() times.

Use the new ThreadPool class and make the Append+Finalize stage parallel.

Herald added a subscriber: iancottrell. · View Herald TranscriptOct 14 2015, 7:34 AM

Missing the TaskPool.h and TaskPool.cpp files?

This revision now requires changes to proceed.Oct 14 2015, 10:30 AM

Please see D13727

Just saw that patch, so this looks good then pending the other patch.

This revision is now accepted and ready to land.Oct 14 2015, 11:28 AM

BTW: if we can modify clang to produce the Apple accelerator tables, we won't need to do any of this indexing which will really speed up debugging! We only produce the Apple accelerator tables on Darwin, but we could on other systems. There is also a new version of the accelerator tables that is going to be in DWARF 5 that is a modified version of our Apple accelerator tables. The Apple accelerator tables are actual accelerator tables that can be mmap'ed in and used as is. All other DWARF accelerator tables are actually not accelerator tables, they are randomly ordered tables that need to be sorted and ingested and often don't contain the correct things that a debugger wants. Like ".debug_pubtypes" will only mention "public" types. Any private types are not in the table. So the table is useless. Same goes for "debug_pubnames": only "public" names... Useless. So our new accelerator tables actually have all of the data in a format that can be used as is, no extra sorting required. They really speed up debugging and stop us from having to index the DWARF manually.

Closed by commit rL250821: Make dwarf parsing multi-threaded (authored by tberghammer). · Explain WhyOct 20 2015, 5:44 AM

This revision was automatically updated to reflect the committed changes.

clayborg added inline comments.Oct 20 2015, 10:21 AM

lldb/trunk/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2087–2088 ↗	(On Diff #37866)	So we are still going to serially wait for the each item in the task list to complete? Don't we want to use TaskRunner::WaitForNextCompletedTask() here?

I reverted this change, as it caused some race condition, but see my comment inline.

lldb/trunk/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2087–2088 ↗	(On Diff #37866)	I don't see any benefit for using TaskRunner::WaitForNextCompletedTask() here because we can't really do anything when only a few task is completed and using TaskRunner ads an extra layer of indirection (to implement WaitForNextCompletedTask) what have a very minor performance hit. One possible improvement we can do is to do the merging of the indexes on the main thread while we are waiting for the parsing tasks to complete, but I am not sure if it will have any performance benefit as it would mean that we do it on a single thread instead of 9 threads we are doing it now (with TaskPool::RunTasks).

See inlined comments.

lldb/trunk/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2087–2088 ↗	(On Diff #37866)	Seems like you could change the future to just return the cu_idx from parser_fn so we can append all items to the member variables in the main thread: while (uint32_t cu_idx : task_runner. WaitForNextCompletedTask()) { m_function_basename_index.Append(function_basename_index[cu_idx)); m_function_fullname_index.Append(function_fullname_index[cu_idx)); ... } Otherwise you are serializing the merging + finalize to be at the end. One nice thing about doing it the way you are doing is that it will be consistent from run to run as the data will always appear in the same order as the debug info file. But these maps should be the same regardless and the oder in which the data comes in shouldn't affect the final content.
2090–2095 ↗	(On Diff #37866)	If you do the Append() calls inside the while loop above, then all we need to do it call Finalize() on each member variable below.

I tried out the implementation you suggests and made some measurements. The difference between the 2 implementation when attaching to LLDB is negligible (the total time spent in SymbolFileDWARF::Index differed by ~1%). The interesting part is that your implementation is faster for parsing C++ libraries (e.g. liblldb, libstdc++) while mine implementation is faster for parsing C libraries (libc, libm, libdl) and I don't understand why it is happening.

With the current measurements in place I don't feel strongly about any version, so if somebody have a strong preference then please let me know.

I plan to recommit this change after committing D13940 as that one fixes the remaining race conditions related to this change I found so far with TSAN

I would venture to say we should optimize for C++ since those libraries tend to be larger, but I will leave the decision to you.

I decided to go with your approach primarily because I tried it out with lower number of threads and it performed marginally better (~10%) in that case

Revision Contents

Path

Size

source/

Plugins/

SymbolFile/

DWARF/

NameToDIE.h

3 lines

NameToDIE.cpp

11 lines

SymbolFileDWARF.cpp

86 lines

Diff 38128

source/Plugins/SymbolFile/DWARF/NameToDIE.h

Show All 32 Lines	public:

void		void
Dump (lldb_private::Stream *s);		Dump (lldb_private::Stream *s);

void		void
Insert (const lldb_private::ConstString& name, const DIERef& die_ref);		Insert (const lldb_private::ConstString& name, const DIERef& die_ref);

void		void
		Append (const NameToDIE& other);

		void
Finalize();		Finalize();

size_t		size_t
Find (const lldb_private::ConstString &name, DIEArray &info_array) const;		Find (const lldb_private::ConstString &name, DIEArray &info_array) const;

size_t		size_t
Find (const lldb_private::RegularExpression& regex, DIEArray &info_array) const;		Find (const lldb_private::RegularExpression& regex, DIEArray &info_array) const;

Show All 11 Lines

source/Plugins/SymbolFile/DWARF/NameToDIE.cpp

	Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines
	{			{
	const uint32_t size = m_map.GetSize();			const uint32_t size = m_map.GetSize();
	for (uint32_t i=0; i<size; ++i)			for (uint32_t i=0; i<size; ++i)
	{			{
	if (!callback(m_map.GetCStringAtIndexUnchecked(i), m_map.GetValueAtIndexUnchecked (i)))			if (!callback(m_map.GetCStringAtIndexUnchecked(i), m_map.GetValueAtIndexUnchecked (i)))
	break;			break;
	}			}
	}			}

				void
				NameToDIE::Append (const NameToDIE& other)
				{
				const uint32_t size = other.m_map.GetSize();
				for (uint32_t i = 0; i < size; ++i)
				{
				m_map.Append(other.m_map.GetCStringAtIndexUnchecked (i),
				other.m_map.GetValueAtIndexUnchecked (i));
				}
				}

source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
#include "lldb/Symbol/VariableList.h"		#include "lldb/Symbol/VariableList.h"
#include "lldb/Symbol/TypeMap.h"		#include "lldb/Symbol/TypeMap.h"

#include "Plugins/Language/CPlusPlus/CPlusPlusLanguage.h"		#include "Plugins/Language/CPlusPlus/CPlusPlusLanguage.h"
#include "Plugins/Language/ObjC/ObjCLanguage.h"		#include "Plugins/Language/ObjC/ObjCLanguage.h"

#include "lldb/Target/Language.h"		#include "lldb/Target/Language.h"

		#include "lldb/Utility/TaskPool.h"

#include "DWARFASTParser.h"		#include "DWARFASTParser.h"
#include "DWARFCompileUnit.h"		#include "DWARFCompileUnit.h"
#include "DWARFDebugAbbrev.h"		#include "DWARFDebugAbbrev.h"
#include "DWARFDebugAranges.h"		#include "DWARFDebugAranges.h"
#include "DWARFDebugInfo.h"		#include "DWARFDebugInfo.h"
#include "DWARFDebugLine.h"		#include "DWARFDebugLine.h"
#include "DWARFDebugPubnames.h"		#include "DWARFDebugPubnames.h"
#include "DWARFDebugRanges.h"		#include "DWARFDebugRanges.h"
▲ Show 20 Lines • Show All 1,949 Lines • ▼ Show 20 Lines	SymbolFileDWARF::Index ()
m_indexed = true;		m_indexed = true;
Timer scoped_timer (__PRETTY_FUNCTION__,		Timer scoped_timer (__PRETTY_FUNCTION__,
"SymbolFileDWARF::Index (%s)",		"SymbolFileDWARF::Index (%s)",
GetObjectFile()->GetFileSpec().GetFilename().AsCString("<Unknown>"));		GetObjectFile()->GetFileSpec().GetFilename().AsCString("<Unknown>"));

DWARFDebugInfo* debug_info = DebugInfo();		DWARFDebugInfo* debug_info = DebugInfo();
if (debug_info)		if (debug_info)
{		{
uint32_t cu_idx = 0;
const uint32_t num_compile_units = GetNumCompileUnits();		const uint32_t num_compile_units = GetNumCompileUnits();
for (cu_idx = 0; cu_idx < num_compile_units; ++cu_idx)		std::vector<NameToDIE> function_basename_index(num_compile_units);
		std::vector<NameToDIE> function_fullname_index(num_compile_units);
		std::vector<NameToDIE> function_method_index(num_compile_units);
		std::vector<NameToDIE> function_selector_index(num_compile_units);
		std::vector<NameToDIE> objc_class_selectors_index(num_compile_units);
		std::vector<NameToDIE> global_index(num_compile_units);
		std::vector<NameToDIE> type_index(num_compile_units);
		std::vector<NameToDIE> namespace_index(num_compile_units);

		auto parser_fn = [this,
		debug_info,
		&function_basename_index,
		&function_fullname_index,
		&function_method_index,
		&function_selector_index,
		&objc_class_selectors_index,
		&global_index,
		&type_index,
		&namespace_index](uint32_t cu_idx)
{		{
DWARFCompileUnit* dwarf_cu = debug_info->GetCompileUnitAtIndex(cu_idx);		DWARFCompileUnit* dwarf_cu = debug_info->GetCompileUnitAtIndex(cu_idx);

bool clear_dies = dwarf_cu->ExtractDIEsIfNeeded (false) > 1;		bool clear_dies = dwarf_cu->ExtractDIEsIfNeeded(false) > 1;

dwarf_cu->Index (m_function_basename_index,		dwarf_cu->Index(function_basename_index[cu_idx],
m_function_fullname_index,		function_fullname_index[cu_idx],
m_function_method_index,		function_method_index[cu_idx],
m_function_selector_index,		function_selector_index[cu_idx],
m_objc_class_selectors_index,		objc_class_selectors_index[cu_idx],
m_global_index,		global_index[cu_idx],
m_type_index,		type_index[cu_idx],
m_namespace_index);		namespace_index[cu_idx]);

// Keep memory down by clearing DIEs if this generate function		// Keep memory down by clearing DIEs if this generate function
// caused them to be parsed		// caused them to be parsed
if (clear_dies)		if (clear_dies)
dwarf_cu->ClearDIEs (true);		dwarf_cu->ClearDIEs(true);
}

m_function_basename_index.Finalize();		return cu_idx;
m_function_fullname_index.Finalize();		};
m_function_method_index.Finalize();
m_function_selector_index.Finalize();		TaskRunner<uint32_t> task_runner;
m_objc_class_selectors_index.Finalize();		for (uint32_t cu_idx = 0; cu_idx < num_compile_units; ++cu_idx)
m_global_index.Finalize();		task_runner.AddTask(parser_fn, cu_idx);
m_type_index.Finalize();
m_namespace_index.Finalize();		while (true)
		{
		std::future<uint32_t> f = task_runner.WaitForNextCompletedTask();
		if (!f.valid())
		break;
		uint32_t cu_idx = f.get();

		m_function_basename_index.Append(function_basename_index[cu_idx]);
		m_function_fullname_index.Append(function_fullname_index[cu_idx]);
		m_function_method_index.Append(function_method_index[cu_idx]);
		m_function_selector_index.Append(function_selector_index[cu_idx]);
		m_objc_class_selectors_index.Append(objc_class_selectors_index[cu_idx]);
		m_global_index.Append(global_index[cu_idx]);
		m_type_index.Append(type_index[cu_idx]);
		m_namespace_index.Append(namespace_index[cu_idx]);
		}

		TaskPool::RunTasks(
		[&]() { m_function_basename_index.Finalize(); },
		zturnerUnsubmitted Not Done Reply Inline Actions Every one of these is locking the same mutex. You could make arrays outside of the async work that is `num_compile_units` entries, and put each result in its own entry in the array. After the wait, you could make more async workers. One for each variable. like `m_function_base_name_index.Append` could be one async job, and the same for each of the other ones. zturner: Every one of these is locking the same mutex. You could make arrays outside of the async work…
		[&]() { m_function_fullname_index.Finalize(); },
		[&]() { m_function_method_index.Finalize(); },
		[&]() { m_function_selector_index.Finalize(); },
		[&]() { m_objc_class_selectors_index.Finalize(); },
		[&]() { m_global_index.Finalize(); },
		[&]() { m_type_index.Finalize(); },
		[&]() { m_namespace_index.Finalize(); });

#if defined (ENABLE_DEBUG_PRINTF)		#if defined (ENABLE_DEBUG_PRINTF)
StreamFile s(stdout, false);		StreamFile s(stdout, false);
s.Printf ("DWARF index for '%s':",		s.Printf ("DWARF index for '%s':",
GetObjectFile()->GetFileSpec().GetPath().c_str());		GetObjectFile()->GetFileSpec().GetPath().c_str());
s.Printf("\nFunction basenames:\n"); m_function_basename_index.Dump (&s);		s.Printf("\nFunction basenames:\n"); m_function_basename_index.Dump (&s);
s.Printf("\nFunction fullnames:\n"); m_function_fullname_index.Dump (&s);		s.Printf("\nFunction fullnames:\n"); m_function_fullname_index.Dump (&s);
s.Printf("\nFunction methods:\n"); m_function_method_index.Dump (&s);		s.Printf("\nFunction methods:\n"); m_function_method_index.Dump (&s);
▲ Show 20 Lines • Show All 2,291 Lines • Show Last 20 Lines