This is an archive of the discontinued LLVM Phabricator instance.

[NFC] Introduce ASTContext::isInSameModule()
Changes PlannedPublic

Authored by ChuanqiXu on Jul 31 2022, 8:18 PM.

Details

Summary

Previously, when we want to compare if two module units are in the same module, we need to compare their primary module name, which is string comparing. String comparing is expensive and we want to avoid it. So here is the patch. This one should be a NFC patch.

Diff Detail

Event Timeline

ChuanqiXu created this revision.Jul 31 2022, 8:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2022, 8:18 PM
ChuanqiXu requested review of this revision.Jul 31 2022, 8:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2022, 8:18 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
erichkeane added inline comments.Aug 1 2022, 6:38 AM
clang/lib/AST/ASTContext.cpp
6217

Please find another way to write this... this is about the most jarring way to compare these...

perhaps you can take advantage of the fact that the compare on 6222 basically does the other half to this:

if (!X || !Y)

replaces the next compare as well.

6232

Doesn't FindAndConstruct do something pretty similar here? This lambda is incredibly inefficient as it requires two 3 separate lookups into the map.

Instead, I believe FindAndConstruct will create a new entry (which can be 'set' if necessary', and then have its value returned.

ChuanqiXu updated this revision to Diff 449173.Aug 1 2022, 8:05 PM

Address comments.

ChuanqiXu marked 2 inline comments as done.Aug 1 2022, 8:07 PM
ChuanqiXu added inline comments.
clang/lib/AST/ASTContext.cpp
6232

On the one hand, FindAndConstruct would construct an empty value, which is not we want. We want the value to be a hash_value of the name. On the other hand, your comments makes sense the implementation could be optimized. Currently, there would be at most 2 lookups and the number should be 1 at the hot path.

Thanks for working on this. I have a few major concerns here:

  • Do we know that it's a bottleneck in practice? E.g. if the module name have different sizes, comparing strings is actually fast. If the module names are short, we are also in a pretty good situation. Having some benchmarks numbers would help to quantify whether the change is needed.
  • llvm::hash_value is not a crypto hash, so we should expect collisions from it. OTOH, doing an extra string comparison when hashes match defeats the purpose of this optimization. Collisions are rare, but if someone hits one after the compiler is released there are literally no workarounds. Instead we can use llvm::SHA1 or llvm::SHA256. (possibly truncated to 128 bits? it's hard to tell how much this is needed without benchmarks.)
  • Why not store the hash directly inside Module and compute it on construction after setting name? If we want these comparisons to be fast, we probably want to avoid even the overhead of hash table access.
ChuanqiXu planned changes to this revision.Aug 2 2022, 2:30 AM
ChuanqiXu marked an inline comment as done.

Thanks for working on this. I have a few major concerns here:

  • Do we know that it's a bottleneck in practice? E.g. if the module name have different sizes, comparing strings is actually fast. If the module names are short, we are also in a pretty good situation. Having some benchmarks numbers would help to quantify whether the change is needed.

No, we don't know it is a bottleneck in practice. (Currently there shouldn't be many users for C++20 Modules). The reason to do this is that we (Iain and me) feel bad to compare string frequently. Benchmarks would be good but I am afraid I can't. Since there are not enough existing benchmarks and it looks costful to construct the meaningful benchmarks. Let's not make the perfect to be the enemy of better.

  • llvm::hash_value is not a crypto hash, so we should expect collisions from it. OTOH, doing an extra string comparison when hashes match defeats the purpose of this optimization. Collisions are rare, but if someone hits one after the compiler is released there are literally no workarounds. Instead we can use llvm::SHA1 or llvm::SHA256. (possibly truncated to 128 bits? it's hard to tell how much this is needed without benchmarks.)
  • Why not store the hash directly inside Module and compute it on construction after setting name? If we want these comparisons to be fast, we probably want to avoid even the overhead of hash table access.

Good idea. Will try to do.

iains added a comment.EditedAug 2 2022, 2:38 AM

Thanks for working on this. I have a few major concerns here:

  • Do we know that it's a bottleneck in practice? E.g. if the module name have different sizes, comparing strings is actually fast. If the module names are short, we are also in a pretty good situation. Having some benchmarks numbers would help to quantify whether the change is needed.

No, we don't know it is a bottleneck in practice. (Currently there shouldn't be many users for C++20 Modules). The reason to do this is that we (Iain and me) feel bad to compare string frequently. Benchmarks would be good but I am afraid I can't. Since there are not enough existing benchmarks and it looks costful to construct the meaningful benchmarks. Let's not make the perfect to be the enemy of better.

We do not have benchmarks, it is true - initially, it seemed that the problem would be confined to the case of checking for the parent of a module partition which would be no concern - since it is one check per TU). However, in later parts of the implementation it has become necessary to check when merging or comparing individual decls - and that is a concern.

  • llvm::hash_value is not a crypto hash, so we should expect collisions from it. OTOH, doing an extra string comparison when hashes match defeats the purpose of this optimization. Collisions are rare, but if someone hits one after the compiler is released there are literally no workarounds. Instead we can use llvm::SHA1 or llvm::SHA256. (possibly truncated to 128 bits? it's hard to tell how much this is needed without benchmarks.)
  • Why not store the hash directly inside Module and compute it on construction after setting name? If we want these comparisons to be fast, we probably want to avoid even the overhead of hash table access.

Good idea. Will try to do.

+1.
if the string is stored locally to the Module, then would it not be sufficient to compare the address?
edit: ignore the second, that will not work.