This change reduces time spent parsing support files by cacheing previously
parsed FileSpecs. This is useful for large C++ binaries where an include
file may appear in many compile units. Profiling shows that constructing the
full paths and validation in the FileSpec constructor is expensive.
Details
Diff Detail
Unit Tests
Event Timeline
can you share the profiles you got? I'm curious on the speed up as I also with large c++ binaries.
Unless I'm missing something, the caching introduced here is not sound. The directory table index is only meaningful when taken in the context of a specific dwarf unit, but this patch uses it globally.
One could fix this by adding the dwarf unit to the caching key (in one way or another), but then the question becomes if/why does that even help? I believe we should not be parsing the *same* file from the *same* unit twice -- if we do, then that's the problem to fix.
I'm also wondering if this wouldn't be better fixed by making the overall remapping interface or implementation smarter. It seems like that, in the common case of zero or one remapping specs, it should be possible to make an implementation that rejects non-matching entries very quickly, without any caching (and also processes matching ones in a reasonable amount of time).
Overall, I think we need to understand this problem better before proceeding. Seeing a profile of where the time is going in your use case (along with a description of that use case) would help a lot.
(I could potentially see how a full_original_path (without units or directory indexes) -> full_remapped_path cache could be useful, on the account of most of the files being headers, and a lot of those headers being repeated in many compile units.)
Just got back from break. Is there a standard benchmarking suite I should
use for lldb? I have been benchmarking using our own tests on the Chrome
DevTools C++ extension, which uses lldb to read dwarf symbols for wasm
binaries.
I've been benchmarking with a test that enumerates all support files in a
moderately sized (121 MB) wasm binary. Here is the profile of the function
that loops over the compile units and gets their support files without the
cacheing change:
- 83.13% symbols_backend::WasmModule::GetSourceScripts[abi:cxx11] - 63.34% lldb_private::CompileUnit::GetSupportFiles - 63.28% SymbolFileDWARF::ParseSupportFiles - 57.99% ParseSupportFilesFromPrologue - 34.05% lldb_private::FileSpecList::EmplaceBack<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, llvm::sys::path::Style&> - 34.01% std::vector<lldb_private::FileSpec, std::allocator<lldb_private::FileSpec> >::emplace_back<std::__cxx11::basic_string<char, std::char_traits<char>, s - 31.59% std::allocator_traits<std::allocator<lldb_private::FileSpec> >::construct<lldb_private::FileSpec, std::__cxx11::basic_string<char, std::char_trait - 31.56% __gnu_cxx::new_allocator<lldb_private::FileSpec>::construct<lldb_private::FileSpec, std::__cxx11::basic_string<char, std::char_traits<char>, st - 31.48% lldb_private::FileSpec::FileSpec - 31.34% lldb_private::FileSpec::SetFile + 19.18% (anonymous namespace)::needsNormalization + 6.32% lldb_private::ConstString::SetString + 1.67% llvm::sys::path::parent_path + 1.62% llvm::sys::path::filename 0.88% llvm::sys::path::remove_dots + 0.88% llvm::SmallString<128u>::SmallString + 2.35% std::vector<lldb_private::FileSpec, std::allocator<lldb_private::FileSpec> >::_M_realloc_insert<std::__cxx11::basic_string<char, std::char_traits<c - 22.75% GetFileByIndex - 22.50% llvm::DWARFDebugLine::Prologue::getFileNameByIndex + 11.98% isPathAbsoluteOnWindowsOrPosix + 7.61% llvm::sys::path::append 0.52% llvm::Twine::Twine 0.52% lldb_private::Module::RemapSourceFile + 4.72% ParseLLVMLineTablePrologue - 13.20% lldb_private::Module::GetCompileUnitAtIndex - 13.15% lldb_private::SymbolFile::GetCompileUnitAtIndex - 13.10% SymbolFileDWARF::ParseCompileUnitAtIndex - 12.69% SymbolFileDWARF::ParseCompileUnit + 11.22% DWARFCompileUnit::GetNonSkeletonUnit + 5.43% llvm::SmallSet<std::pair<llvm::StringRef, llvm::StringRef>, 1u, std::less<std::pair<llvm::StringRef, llvm::StringRef> > >::insert
Total time :
real 0m3.496s
user 0m2.616s
sys 0m0.902s
With the change I've uploaded this drops to:
real 0m2.255s
user 0m1.335s
sys 0m0.938s
I'll work on getting numbers for a correct version of this, though.
Incidentally I had an earlier version of this that cached by the full
filename, saving most of the time spent in the FileSpec constructor but not
the time spent in GetFileByIndex, which I could go back to as that should
be correct and also provided some improvement. But first I'm going to try a
version that caches by {directory name string, file name string} to see if
I can get a similar performance benefit from that.
I've updated the change to make the cacheing correct. I'm seeing the same query time improvement as with the faulty version. Updated profile:
- 62.58% symbols_backend::WasmModule::GetSourceScripts[abi:cxx11] - 25.68% lldb_private::Module::GetCompileUnitAtIndex - 25.55% lldb_private::SymbolFile::GetCompileUnitAtIndex - 25.50% SymbolFileDWARF::ParseCompileUnitAtIndex - 24.72% SymbolFileDWARF::ParseCompileUnit - 21.89% DWARFCompileUnit::GetNonSkeletonUnit DWARFUnit::GetNonSkeletonUnit + DWARFUnit::ExtractUnitDIEIfNeeded + 0.99% DWARFUnit::GetPathStyle + 0.72% lldb_private::FileSpec::FileSpec 0.57% lldb_private::SymbolFile::AssertModuleLock - 22.88% lldb_private::CompileUnit::GetSupportFiles - 22.82% SymbolFileDWARF::ParseSupportFiles - 12.70% ParseSupportFilesFromPrologue + 5.29% std::map<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<cha + 1.94% lldb_private::FileSpec::FileSpec + 1.38% GetFileByIndex + 1.02% lldb_private::FileSpecList::EmplaceBack<lldb_private::FileSpec&> 0.78% std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std:: 0.64% llvm::DWARFDebugLine::Prologue::getIncludeDirectoryFromEntry - 9.34% ParseLLVMLineTablePrologue - 9.33% llvm::DWARFDebugLine::Prologue::parse - 8.78% parseV2DirFileTables + 4.21% llvm::DataExtractor::getULEB128 + 1.70% llvm::DataExtractor::getCStrRef + 0.84% std::vector<llvm::DWARFDebugLine::FileNameEntry, std::allocator<llvm::DWARFDebugLine::FileNameEntry> >::push_back + 11.73% llvm::SmallSet<std::pair<llvm::StringRef, llvm::StringRef>, 1u, std::less<std::pair<llvm::StringRef, llvm::StringRef> > >::insert + 0.79% lldb_private::Module::GetNumCompileUnits + 0.76% lldb_private::ConstString::GetStringRef
As you can see ParseSupportFiles is substantially less expensive and the cacheing overhead is not high. The main remaining issue is loading all the non-skeleton units, which I hope to address in a future change but is proving to be trickier.
Thanks for the explanation and the additional data, and sorry for the delay on my end.
Regarding the "standard" benchmarks, the only thing that comes close is this. I'm not sure if it contains any test case which would demonstrate this (many compile units, many common support/include files), but maybe @teemperor could add one?
I'm afraid I think the new implementation is still not completely sound. The (include) directories in line table are not guaranteed to be absolute -- they can be relative (in DWARF <= 4) to the compilation directory (DW_AT_comp_dir) of the main unit. This is way Prologue::getFileNameByIndex takes an additional compile_dir argument. However, this directory is ignored in the computation of the cache key. So, although unlikely, one could run into the situation where a line table has the same file and directory entries, but they still refer to different files, because they are relative to different compilation directories.
Also, I can't escape the feeling that this should be achievable without an additional caching layer. In your benchmark, the hottest piece of code appears to be the needsNormalization function, which is a essentially a linear scan of the file name. However, one also needs a linear scan (maybe more than one) to implement std::map. What is it that makes needsNormalization so much slower? Is it the lack of vectorization due to the complex control flow? Could the function be rewritten to make it more optimizable?
I should point out that the benchmarks are offline since a while (due to server migration and some plans to rework them).
clang-tidy: error: 'lldb/Core/UniqueCStringMap.h' file not found [clang-diagnostic-error]
not useful
clang-tidy: error: 'lldb/Core/UniqueCStringMap.h' file not found [clang-diagnostic-error]
not useful