This is an archive of the discontinued LLVM Phabricator instance.

[index-while-building] IndexRecordHasher
AbandonedPublic

Authored by jkorous on Feb 27 2019, 4:39 PM.

Details

Summary

Another piece of index-while-building functionality.
RFC: http://lists.llvm.org/pipermail/cfe-dev/2019-February/061432.html

Originally part of review https://reviews.llvm.org/D39050

This implementation is covered by lit tests using it through c-index-test in upcoming patch. It's just split off to make the patch smaller and easier to review.

Diff Detail

Event Timeline

jkorous created this revision.Feb 27 2019, 4:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 27 2019, 4:39 PM

Adding clangd folks in case they want to take a look.

kadircet added inline comments.Mar 8 2019, 2:35 AM
clang/lib/Index/IndexRecordHasher.h
42

Why expose hashing functionality for other types?

I left some comments, but it is difficult for me to review without understanding what the requirements for this hasher are, why some information is hashed in, and some is left out, could you clarify? See detailed comments.

This implementation is covered by lit tests using it through c-index-test in upcoming patch.

This code looks very unit-testable. It also handles many corner cases (what information is hashed and what is left out). c-index-tests are integration tests that are not as good at that, and also they would be testing this code quite indirectly.

clang/lib/Index/IndexRecordHasher.cpp
175 ↗(On Diff #188646)

"caching all decls" => "caching hashes for all decls"

191 ↗(On Diff #188646)

I don't think that hiding a "CreateUnsafe" in an operation with a benign name "asCanon" is a good idea. Why not inline this lambda everywhere, it looks like it is defined to only save typing a few characters.

207 ↗(On Diff #188646)

What about other qualifiers?

Why not use Qualifiers::getAsOpaqueValue() instead of manually converting a subset of qualifiers to unsigned?

209 ↗(On Diff #188646)

Is this a FIXME or?..

215 ↗(On Diff #188646)

Why not hash in T->getTypeClass() uniformly for all types, instead of inventing a random sigil?

219 ↗(On Diff #188646)

There is LValueReferenceType and RValueReferenceType, do we care about the difference? If not, why not?

254 ↗(On Diff #188646)

What about all other types -- array type, function type, decltype(), ...?

What about attributes?

277 ↗(On Diff #188646)

I had to read the implementation of this function to understand what it does. How about renaming it to "cachedHash()" ?

291 ↗(On Diff #188646)

hashImpl => cachedHashImpl

409 ↗(On Diff #188646)

So what is the failure mode for unhandled types, what is the effect on the whole system?

476 ↗(On Diff #188646)

I'd suggest to remove this comment. It is more confusing than helpful. It makes the code look like there's some processing (like a "break" in other cases), but at a close inspection it turns out to be a comment. This kind of fallthrough is pretty common and I don't think it requires a comment. (For example, hashImpl above has this kind of fallthough, but does not have comments like this.)

clang/lib/Index/IndexRecordHasher.h
32

Could you add some explanation about the information that is being hashed? What is the underlying principle behind choosing to hash some aspect of the AST node (or to skip it). For example, I see you're hashing names, but not, say source locations. Or that QualTypes are first translated into canonical types.

What are the completeness constraints for this hasher? What happens if we don't hash something that we should have?

"Caching all produced hashes" seems like an implementation comment. Especially since the implementation chooses not to cache some of the hashes.

jkorous updated this revision to Diff 190317.Mar 12 2019, 1:24 PM

Based on Kadir comment I refactored the code.

jkorous marked an inline comment as done.Mar 12 2019, 1:25 PM
jkorous added a comment.EditedMar 12 2019, 1:31 PM

I left some comments, but it is difficult for me to review without understanding what the requirements for this hasher are, why some information is hashed in, and some is left out, could you clarify? See detailed comments.

Will do.

This implementation is covered by lit tests using it through c-index-test in upcoming patch.

This code looks very unit-testable. It also handles many corner cases (what information is hashed and what is left out). c-index-tests are integration tests that are not as good at that, and also they would be testing this code quite indirectly.

I basically didn't really like the idea of testing against hard-coded hash values in unittests as I consider it to be an implementation detail. I was thinking about integration tests that would work around this by both writing index and reading index and writing assertions against that data. What do you think?

I basically didn't really like the idea of testing against hard-coded hash values in unittests as I consider it to be an implementation detail. I was thinking about integration tests that would work around this by both writing index and reading index and writing assertions against that data. What do you think?

Instead of testing against hardcoded values maybe you can test for difference? Making sure hashes differ or stays same when appropriate changes are done to the contents?

I basically didn't really like the idea of testing against hard-coded hash values in unittests as I consider it to be an implementation detail.

Sorry, that's not what I was suggesting. There are better ways to test hashing. For example, write a piece of source code and check that the hash codes for two declarations that only differ in one aspect are same or different.

For example,

void f(int &&);
void f(int &);

and then assert that hashes for two decls are same or different depending on the desired specification.

I was thinking about integration tests that would work around this by both writing index and reading index and writing assertions against that data.

How is hashing used in this process?

jkorous marked 8 inline comments as done.Mar 12 2019, 2:14 PM

Addressed some comments, going to update the diff.

clang/lib/Index/IndexRecordHasher.cpp
291 ↗(On Diff #188646)

Honestly, I am not sure this would be better. I added a comment about this set of methods in the header file. Wouldn't mind renaming them but think hashImpl is actually quite accurate.

409 ↗(On Diff #188646)

Seems like just the InitialHash is returned at the moment.

I guess using llvm::Optional<hash_code> as a return type would be better. WDYT?

jkorous updated this revision to Diff 190342.Mar 12 2019, 2:14 PM
jkorous marked 2 inline comments as done.

Addressed some of Dmitri's comments.

I see what you mean now. That's a good idea. I'll add some unit tests.

We did performance tests of alternative approach - just hashing the serialized bit code representation. There's a performance regression in the sense that while the current implementation costs approx. extra 2.2% in build time the alternative approach costs 3.8%.

We are not happy about the regression but the best way to fix the current implementation seems to be using the alternative approach as a temporary solution. The plan is to move on with upstreaming the rest of the index-while-building so we can optimize the hasher after i-w-b lands - possibly using the ODR violation hasher.

This means the whole implementation will become just:

hash_code RecordHash = hash_combine_range(State.Buffer.begin(), State.Buffer.end());

where

SmallString<512> Buffer;

Due to necessary change in interface of IndexRecordWriter I'll abandon this patch and land the new hasher later.

jkorous abandoned this revision.Apr 12 2019, 2:42 PM