This is an archive of the discontinued LLVM Phabricator instance.

[lldb] save ConstString hash value to cache files too
AcceptedPublic

Authored by llunak on Apr 29 2022, 3:42 PM.

Details

Reviewers
clayborg
Summary

Creating ConstString requires to compute the hash value of the string. When starting LLDB with index cache enabled and everything cached, profiling shows the computing of the hash to be ~10% of the startup time here. Saving the hash value in the cache too and reusing it can save that time. The size of cache files appear to increase by ~3%.

This requires D122974.

Diff Detail

Event Timeline

llunak created this revision.Apr 29 2022, 3:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 3:42 PM
llunak requested review of this revision.Apr 29 2022, 3:42 PM
llunak updated this revision to Diff 426212.Apr 30 2022, 1:12 AM
llunak edited the summary of this revision. (Show Details)

Added missing hash write to EncodeCStrMap().

llunak added a comment.May 1 2022, 3:37 AM

This makes TestDecodeCStringMaps fail. Is there an easy way for me to re-generate the data for the test?

clayborg requested changes to this revision.May 2 2022, 3:56 PM

A few things we might think of for this patch to improve it a bit: we use a StringTableReader and ConstStringTable to read and write a string table to disk, we could save the hashes before each string in the string table data itself. Then we don't need to change the format of any other sections (the symbol table or the manual DWARF index as we would always write out string table offsets just like before, but the offsets would be different since we would always write the hash + string into the string table data. It is ok for the string table data to contain the hashes for all strings since we store only ConstString values and we always need the hashes. This might save on some space since we would only have 1 hash per string instead of possible many hashes for the same string offset in the string table.

Then we have the option to be able to support loading older string tables in StringTableReader.

Right now when we encode the "ConstStringTable" we write out:

static const llvm::StringRef kStringTableIdentifier("STAB");

We would start writing out a newer identifier like:

static const llvm::StringRef kStringTableIdentifier("STB2");

If we wanted to support the older and newer string table formats, when we call StringTableReader::Decode, we can read the identifier and "do the right thing". On the "STAB" case we would read only the strings and then we could calculate the hashes on the fly if/when needed. In the "STB2" case, the string table would have the hash followed by the string, so if we had the string table offset for any string, we could subtract 4 from this and grab the already computed hash.

Or we can rely on updating the versions of each section. I would vote to allow us to load older and newer string tables if possible. If an older LLDB tries to load a cache file and the "STAB" string table doesn't match because it is the newer "STB2", then decoding will fail on older LLDBs.

Many times we have clients with older and newer LLDB binaries installed on the same system. I am trying to look for ways to make sure they don't keep overwriting each other's cache files. Another option would be to start using different cache file names instead of changing the version number by modifying "ManualDWARFIndex::GetCacheKey()" and Symtab::GetCacheKey() to change the cache file name a bit (use "-symtab2-" instead of "-symtab-"). I am open to suggestions.

This makes TestDecodeCStringMaps fail. Is there an easy way for me to re-generate the data for the test?

You can regenerate the data by:

  • run yaml2obj on the YAML from the TestDecodeCStringMaps function
  • start LLDB and enable the cache
  • load the a.out binary from step 1 that was created by the YAML and let it save a new cache file
  • grab the cache file from the cache directory for this file and run "hexdump" on it
  • edit out the mod time from the signature manually, otherwise when the object file is created from yaml, mod time won't match and it will fail to load
lldb/source/Core/Mangled.cpp
435–437

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

438–439

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

443–448

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

494–495

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

497–498

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

500–502

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

lldb/source/Plugins/SymbolFile/DWARF/ManualDWARFIndex.cpp
509

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

523–527

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

585–586

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

lldb/source/Plugins/SymbolFile/DWARF/NameToDIE.cpp
105–106

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

112–113

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

133–134

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

lldb/source/Symbol/Symtab.cpp
1198–1199

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

1214

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

1234

If we let the string table handle also store the string hash data, then this code doesn't need to change at all.

1241

remove

1263–1264

Remove

1316–1320

remove

This revision now requires changes to proceed.May 2 2022, 3:56 PM
llunak added a comment.May 4 2022, 1:24 AM

A few things we might think of for this patch to improve it a bit: we use a StringTableReader and ConstStringTable to read and write a string table to disk, we could save the hashes before each string in the string table data itself. Then we don't need to change the format of any other sections (the symbol table or the manual DWARF index as we would always write out string table offsets just like before, but the offsets would be different since we would always write the hash + string into the string table data. It is ok for the string table data to contain the hashes for all strings since we store only ConstString values and we always need the hashes. This might save on some space since we would only have 1 hash per string instead of possible many hashes for the same string offset in the string table.

Makes sense.

Many times we have clients with older and newer LLDB binaries installed on the same system. I am trying to look for ways to make sure they don't keep overwriting each other's cache files.

This does not make sense to me. I can possibly imagine it with different users on the same system, but same user? If I have a newer LLDB installed, why would I also use an older one?

Another option would be to start using different cache file names instead of changing the version number by modifying "ManualDWARFIndex::GetCacheKey()" and Symtab::GetCacheKey() to change the cache file name a bit (use "-symtab2-" instead of "-symtab-").

If really needed, then I think this is better. Caches are relatively cheap and throw-away, so IMO it doesn't make much sense to have an ability to read older versions.

llunak updated this revision to Diff 427107.May 4 2022, 12:37 PM
llunak edited the summary of this revision. (Show Details)

Hashes are saved directly in string table.
Changed filenames of cache files to be different from version 1.
No support for reading old format.

llunak added inline comments.May 4 2022, 12:41 PM
lldb/source/Plugins/SymbolFile/DWARF/ManualDWARFIndex.cpp
509

I have changed this, because it's now also a part of the filename.

clayborg requested changes to this revision.May 5 2022, 12:22 PM

Great changes, just a few fixes and this looks great!

lldb/include/lldb/Core/DataFileCache.h
216–219

I would just change these two variables "m_data" and "m_needByteSwap" to be a DataExtractor

lldb/source/Core/DataFileCache.cpp
272

I would still vote to not use this string or add the hash check.

279

Remove hash check as it isn't needed. We will bump the version if needed or change the filename again.

305–309

Remove, we can version bump or rename the cache file again if needed.

306–309

Fix comment since the empty string is now at offset 4, not zero since we emit the hash first.

309–312

If we use a "DataExtractor m_data;" as our ivar as suggested, this just becomes:

317–320
321
lldb/source/Utility/ConstString.cpp
343

Do we want to add the expensive check here?

347

Does our string pool not cache the hash? I don't believe it does, but if it does and we could access that here, it would be great.

This revision now requires changes to proceed.May 5 2022, 12:22 PM
llunak marked 28 inline comments as done.May 22 2022, 12:37 AM
llunak added inline comments.
lldb/source/Utility/ConstString.cpp
347

StringMap does include the full hash value, but I think one would need to have the hash value first in order to find it in the map.

llunak updated this revision to Diff 431217.May 22 2022, 12:40 AM
llunak marked an inline comment as done.
  • use DataExtractor in StringTableReader
  • rely on StringMap assert to check that hash algorithm has not changed
clayborg accepted this revision.May 23 2022, 1:10 PM

One small nit in the code you can choose to fix or not. LGTM. Thanks for the improvements!

lldb/source/Core/DataFileCache.cpp
266–267

NIT: we can combine these two lines

This revision is now accepted and ready to land.May 23 2022, 1:10 PM
llunak updated this revision to Diff 431605.May 24 2022, 12:44 AM
llunak marked an inline comment as done.

Added a unittest that explicitly checks the implementation of the hash value has not changed (+ comments on what to do if that ever happens).

clayborg accepted this revision.May 24 2022, 10:56 AM