This is an archive of the discontinued LLVM Phabricator instance.

Fix deadlock in dwarf logging
AbandonedPublic

Authored by fjricci on Jan 10 2018, 9:29 AM.

Details

Summary

When dwarf parse logging is enabled (ie log enable dwarf info),
deadlocks can occur during dwarf parsing:

Thread 1:
SymbolVendor::FindFunctions (acquires mutex for Module)
SymbolFileDWARF::Index
<spawn task pool for ExtractDIEsIfNeeded> (blocks on task pool completion)

Task pool threads:
ExtractDIEsIfNeeded
Module::LogMessageVerboseBacktrace
Module::GetDescription (tries to acquire mutex for Module and deadlocks)

Since GetDescription is read-only, only touches fairly immutable data
(architecture and filename), and is only used for logging,
the most straightforward fix is to remove the lock guard from this
function.

Event Timeline

fjricci created this revision.Jan 10 2018, 9:29 AM
fjricci edited the summary of this revision. (Show Details)Jan 10 2018, 9:31 AM

GetDescription might be read only, but the code that modifies the description isn't, right?

I guess the question is whether we expect that someone will do something like change the module's filepath while we're printing a log message with that filepath in it.

Actually I don't think even that is racy, because we just get a pointer to the const char *, which is immutable anyway.

It's definitely possible to re-design the lock holding in such a way that we can keep this locked, but I don't want to go through all the work to do that if there isn't any added value to doing so.

Tough to call on this one. Yes the function is dumping simple stuff, but it is using m_arch, m_file and m_objname. It could cause crashes in multi-threaded environments. Is the deadlock caused by an A/B lock issue?

I think it isn't an A/B locking issue. The problem is that 2 different thread (main thread and 1 of the DWARF indexer thread) want to lock the mutex at the same time.

I have a mixed feeling about this change because introducing any sort of race condition (even if it is used only during logging) is problematic as it can read to undefined behavior and then crashes for example by reading and then dereferencing an incorrect const char*. In this specific case I am pretty the code is actually thread safe on all architectures we care about (unless the compiler do something very crazy) because we read a set of ConstString's what is equivalent to reading a single pointer what I think is atomic on all architectures we care about (assuming it is aligned) and then we read data from the ConstString data pool what is read only and thread safe. My concern is that changing something fairly trivial (e.g. change a ConstString to an std::string inside FileSpec) can make this code not thread safe and introduce some new crashes so if possible I think we should try to keep the mutex here.

Why do we need to lock the Module mutex in SymbolVendor::FindFunctions? I think a better option would be to remove that lock and if it is needed then lock it just for the calls where it necessary. The fact that SymbolVendor locks a mutex inside a Module feels like a pretty bad layering violation for me what can cause many other deadlocks so it would be nice to fix that instead of hacking it around here.

I think a better option would be to remove that lock and if it is needed then lock it just for the calls where it necessary. The fact that SymbolVendor locks a mutex inside a Module feels like a pretty bad layering violation for me what can cause many other deadlocks so it would be nice to fix that instead of hacking it around here.

Was thinking about that as well. Will look into it.

fjricci planned changes to this revision.Jan 11 2018, 7:42 AM

Why do we need to lock the Module mutex in SymbolVendor::FindFunctions? I think a better option would be to remove that lock and if it is needed then lock it just for the calls where it necessary. The fact that SymbolVendor locks a mutex inside a Module feels like a pretty bad layering violation for me what can cause many other deadlocks so it would be nice to fix that instead of hacking it around here.

Module + ObjectFile + SymbolVendor + SymbolFile all lock the module's mutex before doing anything that might be adding information to the Module. SymbolVendor::FindFunctions will lazily parse functions from the debug info and populate things inside the module, so the lock is required. You can't put individual locks on Module + ObjectFile + SymbolVendor + SymbolFile as it _will_ cause deadlocks.

SymbolVendor::FindFunctions will lazily parse functions from the debug info and populate things inside the module, so the lock is required.

Can it give up the lock while it's blocked on worker threads? Holding a lock while blocked seems like a recipe for deadlocks (like this one).

We will need to pass the mutex down into any call that needs to let the mutex go. At least that is the only way I could see this working... But then all functions that do anything lazily will need this same treatment. Lot of trouble for the logging.

zturner added a comment.EditedJan 11 2018, 2:23 PM

Seems like the high level problem is that the entity which acquires the mutex is not the same as the entity which decides how data gets into the Module. For example, we call SymbolVendor::FindFunctions and expect that someone is going to somehow get some functions into the Module. But it doesn't anticipate that that someone decides it wants to do it in a multi-threaded fashion. I think as a general principle, when you have this kind of lazy access pattern, you need to clearly separate the responsibilities. The entity that acquires the lock needs to be the same entity that updates the state.

In other words, something like this:

// Not locked.  Does not modify the Module or anything else for that matters.  It simply parses and returns you the result, and lets you decide what you want to do with them.
auto Functions = m_sym_file_ap->FindFunctions(...);

// Now grab the lock and update the Module
std::lock_guard<std::recursive_mutex> guard(module_sp->GetMutex());
module_sp->AddFunctions(Functions);

Yeah, the threading was added later. It started out as single threaded and didn't have this problem.

fjricci abandoned this revision.Mar 29 2018, 7:56 AM

I'm not going to get to this