This is an archive of the discontinued LLVM Phabricator instance.

Reduce time spent parsing support files
Needs ReviewPublic

Authored by Eric on Feb 11 2021, 9:10 AM.

Details

Summary

This change reduces time spent parsing support files by cacheing previously
parsed FileSpecs. This is useful for large C++ binaries where an include
file may appear in many compile units. Profiling shows that constructing the
full paths and validation in the FileSpec constructor is expensive.

Diff Detail

Event Timeline

Eric requested review of this revision.Feb 11 2021, 9:10 AM
Eric created this revision.
aadsm added a comment.Feb 12 2021, 3:28 PM

can you share the profiles you got? I'm curious on the speed up as I also with large c++ binaries.

bmeurer accepted this revision.Feb 15 2021, 5:22 AM

LGTM

This revision is now accepted and ready to land.Feb 15 2021, 5:22 AM

Looks good to me as well, but please also get some feedback from Pavel.

labath requested changes to this revision.Feb 15 2021, 8:14 AM

Unless I'm missing something, the caching introduced here is not sound. The directory table index is only meaningful when taken in the context of a specific dwarf unit, but this patch uses it globally.

One could fix this by adding the dwarf unit to the caching key (in one way or another), but then the question becomes if/why does that even help? I believe we should not be parsing the *same* file from the *same* unit twice -- if we do, then that's the problem to fix.

I'm also wondering if this wouldn't be better fixed by making the overall remapping interface or implementation smarter. It seems like that, in the common case of zero or one remapping specs, it should be possible to make an implementation that rejects non-matching entries very quickly, without any caching (and also processes matching ones in a reasonable amount of time).

Overall, I think we need to understand this problem better before proceeding. Seeing a profile of where the time is going in your use case (along with a description of that use case) would help a lot.

This revision now requires changes to proceed.Feb 15 2021, 8:14 AM

(I could potentially see how a full_original_path (without units or directory indexes) -> full_remapped_path cache could be useful, on the account of most of the files being headers, and a lot of those headers being repeated in many compile units.)

Eric added subscribers: aadsm, bmeurer, dblaikie and 3 others.EditedFeb 23 2021, 9:27 AM

Just got back from break. Is there a standard benchmarking suite I should
use for lldb? I have been benchmarking using our own tests on the Chrome
DevTools C++ extension, which uses lldb to read dwarf symbols for wasm
binaries.

I've been benchmarking with a test that enumerates all support files in a
moderately sized (121 MB) wasm binary. Here is the profile of the function
that loops over the compile units and gets their support files without the
cacheing change:

- 83.13% symbols_backend::WasmModule::GetSourceScripts[abi:cxx11]
   - 63.34% lldb_private::CompileUnit::GetSupportFiles
       - 63.28% SymbolFileDWARF::ParseSupportFiles
          - 57.99% ParseSupportFilesFromPrologue
             - 34.05% lldb_private::FileSpecList::EmplaceBack<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, llvm::sys::path::Style&>
               - 34.01% std::vector<lldb_private::FileSpec, std::allocator<lldb_private::FileSpec> >::emplace_back<std::__cxx11::basic_string<char, std::char_traits<char>, s
                  - 31.59% std::allocator_traits<std::allocator<lldb_private::FileSpec> >::construct<lldb_private::FileSpec, std::__cxx11::basic_string<char, std::char_trait
                     - 31.56% __gnu_cxx::new_allocator<lldb_private::FileSpec>::construct<lldb_private::FileSpec, std::__cxx11::basic_string<char, std::char_traits<char>, st
                        - 31.48% lldb_private::FileSpec::FileSpec
                           - 31.34% lldb_private::FileSpec::SetFile
                              + 19.18% (anonymous namespace)::needsNormalization
                              + 6.32% lldb_private::ConstString::SetString
                              + 1.67% llvm::sys::path::parent_path
                              + 1.62% llvm::sys::path::filename
                                0.88% llvm::sys::path::remove_dots
                              + 0.88% llvm::SmallString<128u>::SmallString
                  + 2.35% std::vector<lldb_private::FileSpec, std::allocator<lldb_private::FileSpec> >::_M_realloc_insert<std::__cxx11::basic_string<char, std::char_traits<c
            - 22.75% GetFileByIndex
               - 22.50% llvm::DWARFDebugLine::Prologue::getFileNameByIndex
                  + 11.98% isPathAbsoluteOnWindowsOrPosix
                  + 7.61% llvm::sys::path::append
                    0.52% llvm::Twine::Twine
              0.52% lldb_private::Module::RemapSourceFile
         + 4.72% ParseLLVMLineTablePrologue
   - 13.20% lldb_private::Module::GetCompileUnitAtIndex
      - 13.15% lldb_private::SymbolFile::GetCompileUnitAtIndex
         - 13.10% SymbolFileDWARF::ParseCompileUnitAtIndex
            - 12.69% SymbolFileDWARF::ParseCompileUnit
               + 11.22% DWARFCompileUnit::GetNonSkeletonUnit
   + 5.43% llvm::SmallSet<std::pair<llvm::StringRef, llvm::StringRef>, 1u, std::less<std::pair<llvm::StringRef, llvm::StringRef> > >::insert

Total time :

real 0m3.496s
user 0m2.616s
sys 0m0.902s

With the change I've uploaded this drops to:

real 0m2.255s
user 0m1.335s
sys 0m0.938s

I'll work on getting numbers for a correct version of this, though.
Incidentally I had an earlier version of this that cached by the full
filename, saving most of the time spent in the FileSpec constructor but not
the time spent in GetFileByIndex, which I could go back to as that should
be correct and also provided some improvement. But first I'm going to try a
version that caches by {directory name string, file name string} to see if
I can get a similar performance benefit from that.

Eric updated this revision to Diff 326114.Feb 24 2021, 9:19 AM

Fixed cacheing behavior to key by include path rather than include directory index.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2021, 9:19 AM
Eric added a comment.Feb 24 2021, 9:33 AM

I've updated the change to make the cacheing correct. I'm seeing the same query time improvement as with the faulty version. Updated profile:

- 62.58% symbols_backend::WasmModule::GetSourceScripts[abi:cxx11]
   - 25.68% lldb_private::Module::GetCompileUnitAtIndex
      - 25.55% lldb_private::SymbolFile::GetCompileUnitAtIndex
         - 25.50% SymbolFileDWARF::ParseCompileUnitAtIndex
            - 24.72% SymbolFileDWARF::ParseCompileUnit
               - 21.89% DWARFCompileUnit::GetNonSkeletonUnit
                    DWARFUnit::GetNonSkeletonUnit
                  + DWARFUnit::ExtractUnitDIEIfNeeded
               + 0.99% DWARFUnit::GetPathStyle
               + 0.72% lldb_private::FileSpec::FileSpec
              0.57% lldb_private::SymbolFile::AssertModuleLock
   - 22.88% lldb_private::CompileUnit::GetSupportFiles
      - 22.82% SymbolFileDWARF::ParseSupportFiles
         - 12.70% ParseSupportFilesFromPrologue
            + 5.29% std::map<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<cha
            + 1.94% lldb_private::FileSpec::FileSpec
            + 1.38% GetFileByIndex
            + 1.02% lldb_private::FileSpecList::EmplaceBack<lldb_private::FileSpec&>
              0.78% std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::
              0.64% llvm::DWARFDebugLine::Prologue::getIncludeDirectoryFromEntry
         - 9.34% ParseLLVMLineTablePrologue
            - 9.33% llvm::DWARFDebugLine::Prologue::parse
               - 8.78% parseV2DirFileTables
                  + 4.21% llvm::DataExtractor::getULEB128
                  + 1.70% llvm::DataExtractor::getCStrRef
                  + 0.84% std::vector<llvm::DWARFDebugLine::FileNameEntry, std::allocator<llvm::DWARFDebugLine::FileNameEntry> >::push_back
   + 11.73% llvm::SmallSet<std::pair<llvm::StringRef, llvm::StringRef>, 1u, std::less<std::pair<llvm::StringRef, llvm::StringRef> > >::insert
   + 0.79% lldb_private::Module::GetNumCompileUnits
   + 0.76% lldb_private::ConstString::GetStringRef

As you can see ParseSupportFiles is substantially less expensive and the cacheing overhead is not high. The main remaining issue is loading all the non-skeleton units, which I hope to address in a future change but is proving to be trickier.

Eric updated this revision to Diff 326125.Feb 24 2021, 9:54 AM

Ran clang format

Thanks for the explanation and the additional data, and sorry for the delay on my end.

Regarding the "standard" benchmarks, the only thing that comes close is this. I'm not sure if it contains any test case which would demonstrate this (many compile units, many common support/include files), but maybe @teemperor could add one?

I'm afraid I think the new implementation is still not completely sound. The (include) directories in line table are not guaranteed to be absolute -- they can be relative (in DWARF <= 4) to the compilation directory (DW_AT_comp_dir) of the main unit. This is way Prologue::getFileNameByIndex takes an additional compile_dir argument. However, this directory is ignored in the computation of the cache key. So, although unlikely, one could run into the situation where a line table has the same file and directory entries, but they still refer to different files, because they are relative to different compilation directories.

Also, I can't escape the feeling that this should be achievable without an additional caching layer. In your benchmark, the hottest piece of code appears to be the needsNormalization function, which is a essentially a linear scan of the file name. However, one also needs a linear scan (maybe more than one) to implement std::map. What is it that makes needsNormalization so much slower? Is it the lack of vectorization due to the complex control flow? Could the function be rewritten to make it more optimizable?

Thanks for the explanation and the additional data, and sorry for the delay on my end.

Regarding the "standard" benchmarks, the only thing that comes close is this. I'm not sure if it contains any test case which would demonstrate this (many compile units, many common support/include files), but maybe @teemperor could add one?

I should point out that the benchmarks are offline since a while (due to server migration and some plans to rework them).