This is an archive of the discontinued LLVM Phabricator instance.

prevent ConstString from calling djbHash() more than once
ClosedPublic

Authored by dblaikie on Apr 2 2022, 5:46 AM.

Details

Summary

Pool::GetConstCStringWithStringRef() may call djbHash() up to 3 times for the same string: First from Pool::hash() when deciding which StringMap in the pool to use, then from read-locked StringMap::find() and then possibly from StringMap::insert() if insertion is needed.

If LLDB's index cache is enabled and everything is cached, calls to djbHash() are ~30% of the total CPU time here when I profile lldb start. Modifying StringMap to allow explicitly passing in a precomputed hash and calling djbHash() just once reduces startup CPU time by ~19%. Explicitly passing in the hash will also allow LLDB to cache the value in a cache file, thus avoiding computing it completely (another commit).

Diff Detail

Event Timeline

llunak created this revision.Apr 2 2022, 5:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2022, 5:46 AM
llunak requested review of this revision.Apr 2 2022, 5:46 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 2 2022, 5:46 AM

The ConstString/StringPool is pretty hot so I'm very excited about making it faster.

I'm slightly concerned about the two hash values (the "full" hash vs the single byte one). That's not something that was introduced in this patch, but passing it around adds an opportunity to get it wrong.

I'm wondering if we could wrap this in a struct and pass that around:

struct HashedStringRef {
  const llvm::StringRef str;
  const unsigned full_hash;
  const uint8_t hash; 
  HashedStringRef(llvm::StringRef s) : str(s), full_hash(djbHash(str)), hash(hash(str)) {}
}

It looks like we always need both except in the StringMap implementation, but if I'm reading the code correctly we'll have constructed it in ConstString already anyway?

I'm slightly concerned about the two hash values (the "full" hash vs the single byte one). That's not something that was introduced in this patch, but passing it around adds an opportunity to get it wrong.

If this is aimed at me, then I don't know what you mean here. The single-byte hash is not passed around, it's used only locally in the two places that my change modifies and one more place where StringMap is not accessed (so no full hash needed there). I see no easy way for it to go wrong, and IMO the HashedStringRef struct would just complicate the code.

LGTM from a LLDB perspective.

labath added subscribers: dblaikie, labath.

In principle, I like this a lot, though I'm not sure if this is the best way to integrate with with the StringMap class. Adding @dblaikie for thoughts on that.

Some background:
LLDB uses a string pool, and during (parallel) dwarf indexing, this string pool is the biggest bottleneck. To reduce contention, lldb actually uses a set of string pools, where it first computes a hash of the string (I believe this is the one-byte hash that Jonas is referring to) to determine which pool (and which mutex) to use, and then performs the actual container operation. This approach has resulted in considerable speedup, but it means the string is getting hashed more than once. Luboš is trying to remove the redundant hashes and speed it up even further.

(sorry, this slipping through somehow - it's on my radar now)

The ConstString/StringPool is pretty hot so I'm very excited about making it faster.

I'm slightly concerned about the two hash values (the "full" hash vs the single byte one). That's not something that was introduced in this patch, but passing it around adds an opportunity to get it wrong.

I'm wondering if we could wrap this in a struct and pass that around:

struct HashedStringRef {
  const llvm::StringRef str;
  const unsigned full_hash;
  const uint8_t hash; 
  HashedStringRef(llvm::StringRef s) : str(s), full_hash(djbHash(str)), hash(hash(str)) {}
}

It looks like we always need both except in the StringMap implementation, but if I'm reading the code correctly we'll have constructed it in ConstString already anyway?

I'm sort of with you here - I don't think this API feature needs to affect lldb much - the scope of these externally-computed hashes is small, but the StringMap API could be more robust/less error prone by having a struct like this. Specifically StringMap::hash could/should return an immutable struct that contains the StringRef and the hash (not that that's bulletproof - the data beneath the StringRef could still be modified externally - but the same is true of any hashing data structure - the keys might be shallow or have mutable state, etc) - and that immutable object must be passed back to the hash-providing insertion functions. (these functions could/should probably still then assert that the hash is correct in case of that underlying mutation - though someone could argue that's overkill and I'd be open to having that discussion).

By immutable I mean that the caller can't go and modify either the StringRef or hash to make these out of sync. These handles are probably still copy assignable. The external code shouldn't be able to create their own (ctor private/protected, etc) - the only way to get one is to call StringMap to produce one.

lldb/source/Utility/ConstString.cpp
106–109

total aside, but it might be nice to refactor m_string_pools[h] out into a named variable here - accessing it 3 times and in a context where it's super important that it's the same thing in all 3 places, seems like it'd be easier to read with a named variable - might make it clear which things need to happen under the lock and which ones don't, etc.

(oh, and the 4th and 5th use a few lines down - and I think that'd be the only uses of h, so h could go away and hash could be called inline in the array index instead, maybe?)

...

struct HashedStringRef {
  const llvm::StringRef str;
  const unsigned full_hash;
  const uint8_t hash; 
  HashedStringRef(llvm::StringRef s) : str(s), full_hash(djbHash(str)), hash(hash(str)) {}
}

...

The external code shouldn't be able to create their own (ctor private/protected, etc) - the only way to get one is to call StringMap to produce one.

But what if I don't want StringMap to compute the hash at all? There's still that 10-15% of CPU time spent in djbHash() when debug info uses exactly[*] that (and LLDB's index cache could be modified to store it). Which means it can be useful for StringMap to allow accepting precomputed hash, and then what purpose will that HashedStringRef have?

(*) Well, not exactly, the seed is different. Doesn't seem an impossible problem to solve though.

(these functions could/should probably still then assert that the hash is correct in case of that underlying mutation - though someone could argue that's overkill and I'd be open to having that discussion).

I'd prefer to have that discussion. If your argument is perfection, then let's do the full monty.

...

struct HashedStringRef {
  const llvm::StringRef str;
  const unsigned full_hash;
  const uint8_t hash; 
  HashedStringRef(llvm::StringRef s) : str(s), full_hash(djbHash(str)), hash(hash(str)) {}
}

...

The external code shouldn't be able to create their own (ctor private/protected, etc) - the only way to get one is to call StringMap to produce one.

But what if I don't want StringMap to compute the hash at all? There's still that 10-15% of CPU time spent in djbHash() when debug info uses exactly[*] that (and LLDB's index cache could be modified to store it). Which means it can be useful for StringMap to allow accepting precomputed hash, and then what purpose will that HashedStringRef have?

I think that David's point was that you would use a (probably static) StringMap method to produce the hash object, and then you use the hash object to select the correct map, and also to insert the key in the map. Something like:

auto hash_obj = MyStringMap::Hash(key);
MyStringMap &map = maps[hash_obj.one_char_hash];
map.insert(has_obj, key, value);

That should only produce one hash computation, right?

(*) Well, not exactly, the seed is different. Doesn't seem an impossible problem to solve though.

I think that's because it's trying to produce a value that is not correlated to the value used inside the individual objects (so that you don't overload some buckets of the maps while keeping others empty). However, there are other ways to achieve that. Since StringMap is always using a power of two for the number of buckets, I think it would be sufficient to use an odd number of StringMap instances (say 255). Using the modulo operation to select the StringMap should select spread out the complete hashes evenly among to individual maps.

And it would mean that the StringMap does not have to concern itself with the one-char hashes (which definitely seems like a layering violation).

(these functions could/should probably still then assert that the hash is correct in case of that underlying mutation - though someone could argue that's overkill and I'd be open to having that discussion).

I'd prefer to have that discussion. If your argument is perfection, then let's do the full monty.

I guess we could put those checks under #ifdef EXPENSIVE_CHECKS (?)

But what if I don't want StringMap to compute the hash at all? There's still that 10-15% of CPU time spent in djbHash() when debug info uses exactly[*] that (and LLDB's index cache could be modified to store it). Which means it can be useful for StringMap to allow accepting precomputed hash, and then what purpose will that HashedStringRef have?

I think that David's point was that you would use a (probably static) StringMap method to produce the hash object, and then you use the hash object to select the correct map, and also to insert the key in the map. Something like:

...

That should only produce one hash computation, right?

Yes, it would, but that's not my point. If StringMap insists on computing the hash itself and doesn't allow outside code to provide it, then that prevents the possible LLDB optimization saving the remaing 10-15% startup CPU cost by not computing the hash at all. If StringMap allows that, then this HashedStringRef idea can be easily circumvented by passing the hash directly, so it seems rather pointless. And the relevant LLDB code from this change is so small that creating HashedStringRef just for that seems like an overkill.

(*) Well, not exactly, the seed is different. Doesn't seem an impossible problem to solve though.

I think that's because it's trying to produce a value that is not correlated to the value used inside the individual objects (so that you don't overload some buckets of the maps while keeping others empty). However, there are other ways to achieve that. Since StringMap is always using a power of two for the number of buckets, I think it would be sufficient to use an odd number of StringMap instances (say 255). Using the modulo operation to select the StringMap should select spread out the complete hashes evenly among to individual maps.

By different seed I meant that DWARF and StringMap use different seed for DJB hash. And that's only because LLVM code historically used the worse 0 seed and some tests rely on that (7b319df29bf4ebe690ca0c41761e46d8b0081293). But it seems that LLDB doesn't even read .debug_names entries into memory, so this having anything to do with DWARF debuginfo is actually moot, it'd only matter for LLDB index cache, which could use any hash it wants.

But you're possibly talking about something unrelated here.

But what if I don't want StringMap to compute the hash at all? There's still that 10-15% of CPU time spent in djbHash() when debug info uses exactly[*] that (and LLDB's index cache could be modified to store it). Which means it can be useful for StringMap to allow accepting precomputed hash, and then what purpose will that HashedStringRef have?

I think that David's point was that you would use a (probably static) StringMap method to produce the hash object, and then you use the hash object to select the correct map, and also to insert the key in the map. Something like:

...

That should only produce one hash computation, right?

Yes, it would, but that's not my point. If StringMap insists on computing the hash itself and doesn't allow outside code to provide it, then that prevents the possible LLDB optimization saving the remaing 10-15% startup CPU cost by not computing the hash at all. If StringMap allows that, then this HashedStringRef idea can be easily circumvented by passing the hash directly, so it seems rather pointless. And the relevant LLDB code from this change is so small that creating HashedStringRef just for that seems like an overkill.

Interesting. I don't know if I missed this somewhere, but could explain what kind of a map operation can lldb perform without computing the hash at least once?

(*) Well, not exactly, the seed is different. Doesn't seem an impossible problem to solve though.

I think that's because it's trying to produce a value that is not correlated to the value used inside the individual objects (so that you don't overload some buckets of the maps while keeping others empty). However, there are other ways to achieve that. Since StringMap is always using a power of two for the number of buckets, I think it would be sufficient to use an odd number of StringMap instances (say 255). Using the modulo operation to select the StringMap should select spread out the complete hashes evenly among to individual maps.

By different seed I meant that DWARF and StringMap use different seed for DJB hash. And that's only because LLVM code historically used the worse 0 seed and some tests rely on that (7b319df29bf4ebe690ca0c41761e46d8b0081293). But it seems that LLDB doesn't even read .debug_names entries into memory, so this having anything to do with DWARF debuginfo is actually moot, it'd only matter for LLDB index cache, which could use any hash it wants.

But you're possibly talking about something unrelated here.

Yeah, I thought that the we used a different seed (or hash?) when computing the StringMap index -- I believe that was the case at some point, but I guess it was changed since then...

Interesting. I don't know if I missed this somewhere, but could explain what kind of a map operation can lldb perform without computing the hash at least once?

All of them :). Computing is not the only way of obtaining the hash of a string. LLDB index cache (settings set symbols.enable-lldb-index-cache true) stores a big bunch of strings and when loading them back into ConstString the next time djbHash() is ~15% of total CPU LLDB startup time. That could be saved if the cache cached the hash values too.

Interesting. I don't know if I missed this somewhere, but could explain what kind of a map operation can lldb perform without computing the hash at least once?

All of them :). Computing is not the only way of obtaining the hash of a string. LLDB index cache (settings set symbols.enable-lldb-index-cache true) stores a big bunch of strings and when loading them back into ConstString the next time djbHash() is ~15% of total CPU LLDB startup time. That could be saved if the cache cached the hash values too.

If the string pool caches the hash value, we could actually write out the hash in the cache file to speed up loading. I didn't want to re-hash every string when I saved the strings out to the cache file due to the cost, but it would be interesting to look into to see if it improves performance.

If the string pool caches the hash value, we could actually write out the hash in the cache file to speed up loading.

The patch doing that is D124704 (except for the string pool caching).

I didn't want to re-hash every string when I saved the strings out to the cache file due to the cost, but it would be interesting to look into to see if it improves performance.

I can measure 10% startup time saved when everything is already cached. Funnily enough, profiler consistently claims that saving caches got faster too (I already use D122975).

I can measure 10% startup time saved when everything is already cached. Funnily enough, profiler consistently claims that saving caches got faster too (I already use D122975).

Never mind, the saving was a bug. But the loading number is valid.

llunak updated this revision to Diff 427187.May 4 2022, 9:03 PM
llunak edited the summary of this revision. (Show Details)

Used a temporary variable instead of repeated 'm_string_pools[h]'.
Added assert that the passed-in hash value matches, guarded by EXPENSIVE_CHECKS. It will assert also hashes computed by StringMap itself, but checking only values passed from outside would mean adding a number of *Impl functions and adding asserts in a number of places.

llunak updated this revision to Diff 427188.May 4 2022, 9:10 PM
clayborg requested changes to this revision.May 5 2022, 11:28 AM
clayborg added inline comments.
lldb/source/Utility/ConstString.cpp
176–177

Can we rename these functions to something other than "hash"? It is a bit confusing to have a function called "hash" that returns a uint8_t when what this is really doing is getting use the string pool index. Maybe "GetPoolIndex"? I could see this function accidentally being used to hash a string instead of using StringPool::hash()

llvm/include/llvm/ADT/StringMap.h
69

Use "uint32_t" instead of unsigned to clarify the exact width of the integer. Or create a "HashType" typedef and then use that.

106
llvm/lib/Support/StringMap.cpp
83
141
This revision now requires changes to proceed.May 5 2022, 11:28 AM
llunak updated this revision to Diff 431216.May 22 2022, 12:34 AM
llunak marked 6 inline comments as done.
  • add a function for selecting a pool in ConstString
  • use uint32_t for hash in StringMap API
clayborg accepted this revision.May 23 2022, 1:13 PM
This revision is now accepted and ready to land.May 23 2022, 1:13 PM

...

struct HashedStringRef {
  const llvm::StringRef str;
  const unsigned full_hash;
  const uint8_t hash; 
  HashedStringRef(llvm::StringRef s) : str(s), full_hash(djbHash(str)), hash(hash(str)) {}
}

...

The external code shouldn't be able to create their own (ctor private/protected, etc) - the only way to get one is to call StringMap to produce one.

But what if I don't want StringMap to compute the hash at all? There's still that 10-15% of CPU time spent in djbHash() when debug info uses exactly[*] that (and LLDB's index cache could be modified to store it). Which means it can be useful for StringMap to allow accepting precomputed hash, and then what purpose will that HashedStringRef have?

If that happens, yeah, there's probably no point protecting this API - though I'd be more cautious of that change/any change that supports serializing the hash as this could end up causing StringMap to have to have stability guarantees that might be unfavorable for other uses (see ABSL maps that intentionally randomize/reseed each instance to ensure users aren't depending on undocumented guarantees - this gives them the flexibility to update the hash algorithm with something better in the future without breaking users who might be serializing the hashes/relying on iteration order/etc)

If we want a structure that can use a stable hash - it might be at that point that we move the hash support entirely out of StringMap and make it pluggable, with the default implementation doing djbHash as before, and even the new "stable" one doing that too, but doing it explicitly/documenting what guarantees it requires (stability within a major version? Across major versions?)

& then I guess what the API looks like is still an open question - perhaps the default trait (the one without any stability guarantees) could have a private implementation and expose that to StringMap via friendship. The stable implementation can have a public implementation for hashing & then an API like the one proposed where they can be passed into StringMap (yeah, without any of the handle safety previously proposed) - and assert when it's passed in that it matches what the trait provides (under EXPENSIVE_CHECKS, probably).

If we want a structure that can use a stable hash

Not for D124704. It doesn't make sense to require a stable hash algorithm for an internal cache file. All that it requires is that StringMap provides the hash value for a string, which it does with the public hash(). If that implementation changes, the EXPENSIVE_CHECKS assert will find that when unittesting the LLDB parts, and in that case the versioned cache file can have its version increased and problem solved. Even in case StringMap implementation changes in a way that no longer makes it feasible to store the precomputed hash value it's simpler to just dump the optimization.

So no need to overcomplicate this. As for the case of somebody else trying to rely on the stability of the hash, I can solve that by adding a comment that says it's not stable (and then that somebody else can go to the lengths of making it stable if needed).

llunak updated this revision to Diff 431597.May 23 2022, 11:19 PM

Added documentation for StringMap::hash(), including an explicit comment saying that the implementation is not guaranteed to be stable.

clayborg accepted this revision.May 24 2022, 12:04 PM

It doesn't make sense to require a stable hash algorithm for an internal cache file.

It's at least a stronger stability requirement than may be provided before - like: stable across process boundaries, at least, by the sounds of it? yeah?
& then still raises the question for me what about minor version updates, is it expected to be stable across those?

It'd be a bit subtle to have to know when to go and update the lldb cache version number because this hash function has changed, for instance. It might be more suitable to have lldb explicitly request a known hash function rather than the generic one (even if they're identical at the moment) so updates to LLVM's core libraries don't subtly break the hashing/cache system here. (I would guess there's no cross-version hash testing? So it seems like such a change could produce fairly subtle breakage that would slip under the radar fairly easily?)

It doesn't make sense to require a stable hash algorithm for an internal cache file.

It's at least a stronger stability requirement than may be provided before - like: stable across process boundaries, at least, by the sounds of it? yeah?

It's meant to be the same for the same library binary.

& then still raises the question for me what about minor version updates, is it expected to be stable across those?

Is anything expected to be stable across those? If so, that doesn't seem to be the reality of it (a random quick check finds two ABI changes just between 14.0.4 and 14.0.5, e6de9ed37308e46560243229dd78e84542f37ead and 53433dd0b5034681e1866e74651e8527a29e9364).

It'd be a bit subtle to have to know when to go and update the lldb cache version number because this hash function has changed, for instance. It might be more suitable to have lldb explicitly request a known hash function rather than the generic one (even if they're identical at the moment) so updates to LLVM's core libraries don't subtly break the hashing/cache system here. (I would guess there's no cross-version hash testing? So it seems like such a change could produce fairly subtle breakage that would slip under the radar fairly easily?)

D124704 adds a unittest that compares StringMap::hash() to a known hardcoded value, so whenever the hash implementation changes, it won't be possible to unittest LLDB with that change, and that will be the time to change the lldb cache version. If the theory is that this should keep working even with the library changing without LLDB rebuild, then as I wrote above that theory needs double-checking first. And additionally a good question to ask would be if it's really a good idea to do incompatible implementation changes to a class that has half of its functionality inline. Finally, there have been already attempts to change the hash function to use the better non-zero seed (D97396), and they were reverted because something depended on the hash function not changing, so I don't see it that likely for the hash function to change just like that in a minor update.

But if after all this that's still the theory, I guess StringMap could get an additional template parameter specifying the hash function, if it's actually worth the effort.

@clayborg : BTW, this is exactly the reason why I wanted the cache files header to include a hash value for a known string.

It doesn't make sense to require a stable hash algorithm for an internal cache file.

It's at least a stronger stability requirement than may be provided before - like: stable across process boundaries, at least, by the sounds of it? yeah?

It's meant to be the same for the same library binary.

& then still raises the question for me what about minor version updates, is it expected to be stable across those?

Is anything expected to be stable across those? If so, that doesn't seem to be the reality of it (a random quick check finds two ABI changes just between 14.0.4 and 14.0.5, e6de9ed37308e46560243229dd78e84542f37ead and 53433dd0b5034681e1866e74651e8527a29e9364).

My udnerstanding is that generally ABI stability is intended to be guaranteed across minor releases (& those patches don't look like ABI breaks to me - the first one changes the mangling of intended-to-be-local symbols so they don't collide, so it should cause valid programs to link when they previously failed to link. The second seems to add a new target that wasn't present - so nothing to break against?) but that's probably orthogonal to the stability of the map/cache/expectations of folks releasing and using lldb.

It'd be a bit subtle to have to know when to go and update the lldb cache version number because this hash function has changed, for instance. It might be more suitable to have lldb explicitly request a known hash function rather than the generic one (even if they're identical at the moment) so updates to LLVM's core libraries don't subtly break the hashing/cache system here. (I would guess there's no cross-version hash testing? So it seems like such a change could produce fairly subtle breakage that would slip under the radar fairly easily?)

D124704 adds a unittest that compares StringMap::hash() to a known hardcoded value, so whenever the hash implementation changes, it won't be possible to unittest LLDB with that change, and that will be the time to change the lldb cache version.

Ah, good stuff - doesn't guarantee that any hash change necessarily breaks the test, but certainly helps/seems like a good idea, thanks!

If the theory is that this should keep working even with the library changing without LLDB rebuild, then as I wrote above that theory needs double-checking first. And additionally a good question to ask would be if it's really a good idea to do incompatible implementation changes to a class that has half of its functionality inline.

Sorry, I wasn't meaning to discuss dynamic library versioning issues/mismatches, just static/fully matched versions.

Finally, there have been already attempts to change the hash function to use the better non-zero seed (D97396), and they were reverted because something depended on the hash function not changing, so I don't see it that likely for the hash function to change just like that in a minor update.

That something seems to have been another part of lldb - and that's the sort of change I'd like to enable/not make harder by avoiding more dependence on this ordering/hashing.

But if after all this that's still the theory, I guess StringMap could get an additional template parameter specifying the hash function, if it's actually worth the effort.

Yeah, a little traits class or the like is roughly what I'd picture.

D124704 adds a unittest that compares StringMap::hash() to a known hardcoded value, so whenever the hash implementation changes, it won't be possible to unittest LLDB with that change, and that will be the time to change the lldb cache version.

Ah, good stuff - doesn't guarantee that any hash change necessarily breaks the test, but certainly helps/seems like a good idea, thanks!

I think that does guarantee it in practice. If changing a string hash implementation does not change the result for a randomly chosen non-trivial input, then either the implementation is compatible or it's very poor, so I don't see why either of these should be a practical concern. But if the highly improbable case is a concern for some reason I fail to see, I can make the test check two different inputs, which should make the case virtually impossible.

If the theory is that this should keep working even with the library changing without LLDB rebuild, then as I wrote above that theory needs double-checking first. And additionally a good question to ask would be if it's really a good idea to do incompatible implementation changes to a class that has half of its functionality inline.

Sorry, I wasn't meaning to discuss dynamic library versioning issues/mismatches, just static/fully matched versions.

Then I still don't know what the problem is supposed to be. If the StringMap hash implementation ever changes, the necessary LLDB rebuild will detect this, the relevant LLDB parts will get adjusted and problem solved.

If the theory is that this should keep working even with the library changing without LLDB rebuild, then as I wrote above that theory needs double-checking first. And additionally a good question to ask would be if it's really a good idea to do incompatible implementation changes to a class that has half of its functionality inline.

Sorry, I wasn't meaning to discuss dynamic library versioning issues/mismatches, just static/fully matched versions.

Then I still don't know what the problem is supposed to be. If the StringMap hash implementation ever changes, the necessary LLDB rebuild will detect this, the relevant LLDB parts will get adjusted and problem solved.

What I mean is if the cache is used across statically linked versions - eg: cache is created, someone installs an update to lldb, then the cache is read back and misinterprets the hashes in the cache if the hash algorithm had changed between versions.

Finally, there have been already attempts to change the hash function to use the better non-zero seed (D97396), and they were reverted because something depended on the hash function not changing, so I don't see it that likely for the hash function to change just like that in a minor update.

That something seems to have been another part of lldb - and that's the sort of change I'd like to enable/not make harder by avoiding more dependence on this ordering/hashing.

But if after all this that's still the theory, I guess StringMap could get an additional template parameter specifying the hash function, if it's actually worth the effort.

Yeah, a little traits class or the like is roughly what I'd picture.

^

Then I still don't know what the problem is supposed to be. If the StringMap hash implementation ever changes, the necessary LLDB rebuild will detect this, the relevant LLDB parts will get adjusted and problem solved.

What I mean is if the cache is used across statically linked versions - eg: cache is created, someone installs an update to lldb, then the cache is read back and misinterprets the hashes in the cache if the hash algorithm had changed between versions.

That is not going to happen, that updated LLDB will be using a different cache version. That's what the part you're replying to means. You cannot have two LLDB builds using the same cache version but different hash implementations, as long as you do not ignore LLDB unittests failing.

Finally, there have been already attempts to change the hash function to use the better non-zero seed (D97396), and they were reverted because something depended on the hash function not changing, so I don't see it that likely for the hash function to change just like that in a minor update.

That something seems to have been another part of lldb - and that's the sort of change I'd like to enable/not make harder by avoiding more dependence on this ordering/hashing.

But if after all this that's still the theory, I guess StringMap could get an additional template parameter specifying the hash function, if it's actually worth the effort.

Yeah, a little traits class or the like is roughly what I'd picture.

^

^ I think it's still worthwhile/necessary to separate LLDB's use case/hashing algorithm choice from LLVM's so LLVM's code can be changed to be more change resilient in a way that LLDB's cannot (eg: random seeds will never be usable by LLDB but may be for LLVM).

Then I still don't know what the problem is supposed to be. If the StringMap hash implementation ever changes, the necessary LLDB rebuild will detect this, the relevant LLDB parts will get adjusted and problem solved.

What I mean is if the cache is used across statically linked versions - eg: cache is created, someone installs an update to lldb, then the cache is read back and misinterprets the hashes in the cache if the hash algorithm had changed between versions.

That is not going to happen, that updated LLDB will be using a different cache version. That's what the part you're replying to means. You cannot have two LLDB builds using the same cache version but different hash implementations, as long as you do not ignore LLDB unittests failing.

Fair enough - probably good to have some commentary in the test cases that makes it really clear that if the test needs to be updated then the version needs to be updated. (is that patch already posted? Could you link to that comment from this review?)

^ I think it's still worthwhile/necessary to separate LLDB's use case/hashing algorithm choice from LLVM's so LLVM's code can be changed to be more change resilient in a way that LLDB's cannot (eg: random seeds will never be usable by LLDB but may be for LLVM).

Possibly, but that can be done in another patch. Preferably by somebody who has an actual use case for it.

Fair enough - probably good to have some commentary in the test cases that makes it really clear that if the test needs to be updated then the version needs to be updated. (is that patch already posted? Could you link to that comment from this review?)

Already done. D124704 , https://reviews.llvm.org/D124704#change-yXGqjEgMMr0u .

dblaikie commandeered this revision.Tue, Dec 12, 2:59 PM
dblaikie closed this revision.
dblaikie edited reviewers, added: llunak; removed: dblaikie.