This is an archive of the discontinued LLVM Phabricator instance.

[NFC] Refactor symbol table parsing.
ClosedPublic

Authored by clayborg on Nov 19 2021, 2:41 PM.

Details

Summary

Symbol table parsing has evolved over the years and many plug-ins contained duplicate code in the ObjectFile::GetSymtab() that used to be pure virtual. With this change, the "Symbtab *ObjectFile::GetSymtab()" is no longer virtual and will end up calling a new "void ObjectFile::ParseSymtab(Symtab &symtab)" pure virtual function to actually do the parsing. This helps centralize the code for parsing the symbol table and allows the ObjectFile base class to do all of the common work, like taking the necessary locks and creating the symbol table object itself. Plug-ins now just need to parse when they are asked to parse as the ParseSymtab function will only get called once.

This is a retry of the original patch https://reviews.llvm.org/D113965 which was reverted. There was a deadlock in the Manual DWARF indexing code during symbol preloading where the module was asked on the main thread to preload its symbols, and this would in turn cause the DWARF manual indexing to use a thread pool to index all of the compile units, and if there were relocations on the debug information sections, these threads could ask the ObjectFile to load section contents, which could cause a call to ObjectFileELF::RelocateSection() which would ask for the symbol table from the module and it would deadlock. We can't lock the module in ObjectFile::GetSymtab(), so the solution I am using is to use a llvm::once_flag to create the symbol table object once and then lock the Symtab object. Since all APIs on the symbol table use this lock, this will prevent anyone from using the symbol table before it is parsed and finalized and will avoid the deadlock I mentioned. ObjectFileELF::GetSymtab() was never locking the module lock before and would put off creating the symbol table until somewhere inside ObjectFileELF::GetSymtab(). Now we create it one time inside of the ObjectFile::GetSymtab() and immediately lock it which should be safe enough. This avoids the deadlocks and still provides safety.

Diff Detail

Event Timeline

clayborg created this revision.Nov 19 2021, 2:41 PM
clayborg requested review of this revision.Nov 19 2021, 2:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 19 2021, 2:41 PM

I don't believe this solution is correct.

How did this work before? Is it because ObjectFileELF::GetSymtab contained the same (incorrect) unique_ptr pattern?

How about we just prime the symtab (or whatever is needed) before we start the thread pool indexing?

lldb/source/Symbol/ObjectFile.cpp
723

This is not true. unique_ptr makes no guarantees about the safety of concurrent accesses to a single unique_ptr object. While this may work on current hardware (it will most likely work on x86), it will make tsan very unhappy.

I don't believe this solution is correct.

How did this work before? Is it because ObjectFileELF::GetSymtab contained the same (incorrect) unique_ptr pattern?

It did, but because each ObjectFile subclass did things there own way, it worked only because ObjectFileELF was never taking the module lock to begin with. All others object files did.

How about we just prime the symtab (or whatever is needed) before we start the thread pool indexing?

Doesn't work, I tried that. The problem is the way the .dwo file stuff currently works. With some of the tests, we are loading a .dwo file as the main module. In this cause we end up with a target module, then ends up opening up a ".dwo" module as a _separate_ object file in the DWO support in the DWARF classes (not the same ObjectFile that the target Module has, but a different one), but this extra object file is saying its module is the main module. I didn't make any of the DWO changes to our DWARF classes, so I am still coming up to speed with why this weird case where two different object files claim the same module is their owning module. It is probably because of the way the test was written. But that being said, we should be able to load the .dwo files as a main module to inspect what we can out of them. Do all .dwo files claim the main executable that references them is their owning module? If so, many other deadlocks can happen during indexing.

Regardless of this problem, the DWARF indexing has caused deadlocks before due to the module lock. Like the the Module::GetDescription(...) when we enabled DWARF logging. So even if we solve this one problem, we can run into it again the next time DWARF indexing causes some file to load/fetch something. For many of the things inside of the module that are populated once (section list, symbol table, etc), we will run into issues with any threading code do the module lock.

One way to possibly fix this is to stop the Module class from using the mutex when populating these "compute once and then hand out an object" and have them use a llvm::once_flag as a member variable to control access. We might need to pull the same trick I do with the ObjectFile where the llvm::once_flag is put into a std::unique_ptr<llvm::once_flag> so it can be cleared if the this object that has handed out can be cleared or changed (like the trick I used in ObjectFile::ClearSymtab()).

So the fix is not straight forward if we aren't comfortable with relying on the Symtab mutex to protect the symbol table that I proposed in this patch.

Do all .dwo files claim the main executable that references them is their owning module?

The short answer is "yes".

The reason for that is that the dwo files do not constitute a module in the normal sense (not even in the sense that is used by DebugMapModule in SymbolFileDWARFDebugMap). In contains some debug info, but only a part of it, and that part is meaningless without the information in the main file. Like, you can (syntactically) parse the DIE tree, but you won't be able to resolve any of the addresses contained therein. In an DebugMap module, the addresses will resolve to sections of the .o file, and then there's a separate fixup step (independent of dwarf) to remap them to the main file. OTOH, in a dwo file you just get nothing.

*However*, dwo files are still regular (elf) object files, and the object file component can be parsed independently of anything else. In this sense, it would be reasonable to create a separate module for them, to ensure their isolation. In fact, we currently need to jump through some hoops to ensure that the dwo object information does not leak into the main module (see the update_module_section_list argument to ObjectFile::GetSectionList). Having a separate Module object would remove the need for that.

Overall, in either case (a separate module or not) some part of the interface will end up being weird (a separate module is better for the object file class, a single module for the symbol file). It's possible the overall messiness would be reduced in the separate module version, but it's hard to say without trying to implement that.

However, maybe we don't figure that out now. See the inline comment.

lldb/source/Symbol/ObjectFile.cpp
723

That said, I think the fix could be as simple removing these two lines. call_once already has a fast path for already-initialized case, anyway, and going through the call_once function would ensure that a unique_ptr read cannot happen concurrently with any write to it.

clayborg updated this revision to Diff 390444.Nov 29 2021, 12:39 PM

Let the call_once protect the ObjectFile::m_symtab_up as suggested.

labath accepted this revision.Nov 30 2021, 7:47 AM
This revision is now accepted and ready to land.Nov 30 2021, 7:47 AM
This revision was automatically updated to reflect the committed changes.

I will check this out on linux. Any reason why I did not get a message to my email that this was failing?

I will check this out on linux. Any reason why I did not get a message to my email that this was failing?

I checked this out on normal linux and it passes correctly.

I modified the test file to dump more input context so we can see what is going on on the lldb-arm-ubuntu buildbots in the error output with:

commit 266a66c915cbbc36b1a3887963eb97f32306c7e4 (HEAD -> main, origin/main, origin/HEAD)
Author: Greg Clayton <gclayton@fb.com>
Date: Thu Dec 2 15:47:15 2021 -0800

Include extra input contents on this test so we can see why lldb-arm-ubuntu buildbot is failing.

Only lldb-arm-ubuntu is failing after https://reviews.llvm.org/D114288 and there isn't enough input context to see why this is failing. It works on x86_64 linux just fine.

Thanks for doing that! I was able to successfully run the test on linux, and then I added extra context to the stop. It seems that the two breakpoints are set correctly, but the second breakpoint is not hit and the program exits. So all is good from the symbol table changes as far as I have been able to tell since both breakpoints were set. Below is the extra output that shows the first breakpoint was hit, and then we continue and don't hit the second breakpoint, but we do exit with the correct exit status of 12, so the program isn't crashing. In the lines below we now can see after the "continue" command, that is just exits with the expect statu of 12 (which is argc == 1 and it multiplies it by 3 and then by 4). I am happy to help look at this. If you can send a copy of the arm "minidebuginfo-set-and-hit-breakpoint.test.tmp.binary" binary I might be able to see what is going on.

 1: (lldb) command source -s 0 '/home/tcwg-buildslave/worker/lldb-arm-ubuntu/build/tools/lldb/test/Shell/lit-lldb-init' 
 2: Executing commands in '/home/tcwg-buildslave/worker/lldb-arm-ubuntu/build/tools/lldb/test/Shell/lit-lldb-init'. 
 3: (lldb) # LLDB init file for the LIT tests. 
 4: (lldb) settings set symbols.enable-external-lookup false 
 5: (lldb) settings set plugin.process.gdb-remote.packet-timeout 60 
 6: (lldb) settings set interpreter.echo-comment-commands false 
 7: (lldb) settings set symbols.clang-modules-cache-path "/home/tcwg-buildslave/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-lldb" 
 8: (lldb) settings set target.auto-apply-fixits false 
 9: (lldb) settings set target.inherit-tcc true 
10: (lldb) settings set target.detach-on-error false 
11: (lldb) target create "/home/tcwg-buildslave/worker/lldb-arm-ubuntu/build/tools/lldb/test/Shell/ObjectFile/ELF/Output/minidebuginfo-set-and-hit-breakpoint.test.tmp.binary" 
12: Current executable set to '/home/tcwg-buildslave/worker/lldb-arm-ubuntu/build/tools/lldb/test/Shell/ObjectFile/ELF/Output/minidebuginfo-set-and-hit-breakpoint.test.tmp.binary' (arm). 
13: (lldb) b multiplyByThree 
14: Breakpoint 1: where = minidebuginfo-set-and-hit-breakpoint.test.tmp.binary`multiplyByThree, address = 0x00010440 
15: (lldb) b multiplyByFour 
16: Breakpoint 2: where = minidebuginfo-set-and-hit-breakpoint.test.tmp.binary`multiplyByFour, address = 0x00010428 
17: (lldb) run 
18: Process 3530626 stopped 
19: * thread #1, name = 'minidebuginfo-s', stop reason = breakpoint 1.1

check:71'0 X error: no match found

20:  frame #0: 0x00010440 minidebuginfo-set-and-hit-breakpoint.test.tmp.binary`multiplyByThree

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:71'1 ? possible intended match

21: minidebuginfo-set-and-hit-breakpoint.test.tmp.binary`multiplyByThree:

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

22: -> 0x10440 <+0>: sub sp, sp, #4

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

23:  0x10444 <+4>: str r0, [sp]

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

24:  0x10448 <+8>: ldr r0, [sp]

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

25:  0x1044c <+12>: movw r1, #0x3

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

26: Process 3530626 launched: '/home/tcwg-buildslave/worker/lldb-arm-ubuntu/build/tools/lldb/test/Shell/ObjectFile/ELF/Output/minidebuginfo-set-and-hit-breakpoint.test.tmp.binary' (arm)

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

27: (lldb) continue

check:71'0 ~~~~~~~~~~~~~~~~

28: Process 3530626 resuming

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~

29: Process 3530626 exited with status = 12 (0x0000000c)

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

30: (lldb) breakpoint list -v

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~

31: Current breakpoints:

check:71'0 ~~~~~~~~~~~~~~~~~~~~~

32: 1: name = 'multiplyByThree'

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

33:  1.1:

check:71'0 ~~~~~~~

34:  module = /home/tcwg-buildslave/worker/lldb-arm-ubuntu/build/tools/lldb/test/Shell/ObjectFile/ELF/Output/minidebuginfo-set-and-hit-breakpoint.test.tmp.binary

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

35:  symbol = multiplyByThree

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~

36:  address = 0x00010440

check:71'0 ~~~~~~~~~~~~~~~~~~~~~~

37:  resolved = true

check:71'0 ~~~~~~~~~~~~~~~~~

38:  hardware = false

check:71'0 ~~~~~~~~~~~~~~~~~~

39:  hit count = 1

check:71'0 ~~~~~~~~~~~~~~~~

40:

check:71'0 ~

kastiglione added inline comments.
lldb/source/Symbol/ObjectFile.cpp
739

During expression evaluation, I am hitting a deadlock where a reentrant call to ObjectFile::GetSymtab is happening with he same once_flag, causing it to block. This is macOS/arm64.

Here's a (simplified) stack:

frame #4: void llvm::call_once<lldb_private::ObjectFile::GetSymtab()::$_0>(flag=0x..., F=0x...)::$_0&&) at Threading.h:89
frame #5: lldb_private::ObjectFile::GetSymtab(this=0x...) at ObjectFile.cpp:758
frame #6: SymbolFileSymtab::CalculateAbilities(this=0x...) at SymbolFileSymtab.cpp:60
frame #7: lldb_private::SymbolFile::GetAbilities(this=0x...) at SymbolFile.h:105
frame #8: lldb_private::SymbolFile::FindPlugin(objfile_sp=std::shared_ptr<lldb_private::ObjectFile>::element_type @ 0x... strong=7 weak=2) at SymbolFile.cpp:72
frame #9: lldb_private::SymbolVendor::AddSymbolFileRepresentation(this=0x..., objfile_sp=std::shared_ptr<lldb_private::ObjectFile>::element_type @ 0x... strong=7 weak=2) at SymbolVendor.cpp:71
frame #10:`lldb_private::SymbolVendor::FindPlugin(module_sp=std::shared_ptr<lldb_private::Module>::element_type @ 0x... strong=12 weak=12, feedback_strm=0x...) at SymbolVendor.cpp:57
frame #11:`lldb_private::Module::GetSymbolFile(this=0x..., can_create=true, feedback_strm=0x...) at Module.cpp:1070
frame #12:`lldb_private::Module::FindFunctions(this=0x..., name="$s17_StringProcessing14AnyRegexOutputVMa", parent_decl_ctx=0x..., name_type_mask=eFunctionNameTypeFull, options=0x..., sc_list=0x...) at Module.cpp:830
frame #13:`lldb_private::ModuleList::FindFunctions(this=0x..., name="$s17_StringProcessing14AnyRegexOutputVMa", name_type_mask=eFunctionNameTypeFull, options=0x..., sc_list=0x...) const at ModuleList.cpp:512
frame #14:`lldb_private::IRExecutionUnit::FindInSymbols(this=0x..., names=size=2, sc=0x..., symbol_was_missing_weak=0x...) at IRExecutionUnit.cpp:841
frame #15:`lldb_private::IRExecutionUnit::FindSymbol(this=0x..., name="_$s17_StringProcessing14AnyRegexOutputVMa", missing_weak=0x...) at IRExecutionUnit.cpp:913
frame #16:`lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence(this=0x..., Name="_$s17_StringProcessing14AnyRegexOutputVMa", missing_weak=0x...) at IRExecutionUnit.cpp:1008
frame #17:`lldb_private::IRExecutionUnit::MemoryManager::findSymbol(this=0x..., Name="_$s17_StringProcessing14AnyRegexOutputVMa") at IRExecutionUnit.cpp:986
frame #18:`llvm::LinkingSymbolResolver::findSymbol(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 128
frame #19:`llvm::LegacyJITSymbolResolver::lookup(std::set<llvm::StringRef, std::less<llvm::StringRef>, std::allocator<llvm::StringRef> > const&, llvm::unique_function<void (llvm::Expected<std::map<llvm::StringRef, llvm::JITEvaluatedSymbol, std::less<llvm::StringRef>, std::allocator<std::pair<llvm::StringRef const, llvm::JITEvaluatedSymbol> > > >)>) + 308
frame #20:`llvm::RuntimeDyldImpl::resolveExternalSymbols() + 708
frame #21:`llvm::RuntimeDyldImpl::resolveRelocations() + 248
frame #22:`llvm::MCJIT::finalizeLoadedModules() + 56
frame #23:`llvm::MCJIT::getGlobalValueAddress(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 68
frame #24:`lldb_private::IRExecutionUnit::PopulateSymtab(this=0x..., obj_file=0x..., symtab=0x...) at IRExecutionUnit.cpp:1244
frame #25:`ObjectFileJIT::ParseSymtab(this=0x..., symtab=0x...) at ObjectFileJIT.cpp:112
frame #26:`lldb_private::ObjectFile::GetSymtab(this=0x...)::$_0::operator()() const at ObjectFile.cpp:764
frame #27:`decltype(__f=0x...)::$_0>(fp)()) std::__invoke<lldb_private::ObjectFile::GetSymtab()::$_0>(lldb_private::ObjectFile::GetSymtab()::$_0&&) at type_traits:3918
frame #28:`void std::__call_once_param<std::tuple<lldb_private::ObjectFile::GetSymtab()::$_0&&> >::__execute<>(this=0x..., (null)=__tuple_indices<> @ 0x...) at mutex:630
frame #29:`std::__call_once_param<std::tuple<lldb_private::ObjectFile::GetSymtab()::$_0&&> >::operator(this=0x...)() at mutex:622
frame #30:`void std::__call_once_proxy<std::tuple<lldb_private::ObjectFile::GetSymtab()::$_0&&> >(__vp=0x...) at mutex:658
frame #31:++.1.dylib`std::__call_once(unsigned long volatile&, void*, void (*)(void*)) + 180
frame #32:`void std::call_once<lldb_private::ObjectFile::GetSymtab()::$_0>(__flag=0x..., __func=0x...)::$_0&&) at mutex:676
frame #33:`void llvm::call_once<lldb_private::ObjectFile::GetSymtab()::$_0>(flag=0x..., F=0x...)::$_0&&) at Threading.h:89
frame #34:`lldb_private::ObjectFile::GetSymtab(this=0x...) at ObjectFile.cpp:758
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 8:58 AM
Herald added subscribers: pmatos, asb. · View Herald Transcript

@clayborg friendly ping

Do you have an repro case that will show this issue? I would need to debug it. The main issue is ObjectFileJIT::ParseSymtab() is doing global lookups when it is creating it's own symbol table, which causes the deadlock

Do you have an repro case that will show this issue? I would need to debug it.

@clayborg sorry for the late reply. The repro happened to me while debugging some Swift:

func main() {
    let digits = /\d+/
    let match = "abc123".firstMatch(of: digits)
    // stop here
}

at the stop command, I ran p match?.anyRegexOutput which causes the deadlock.