The CHashVisitor can be used to determine a unique hash for a translation unit. The hash is stable across compiler invocations and if two translation units have the same hash, the resulting object file is semantically equal (code, data, etc. are equal; debug information may be different).
Details
- Reviewers
- None
Diff Detail
- Repository
- rC Clang
- Build Status
Buildable 12712 Build 12712: arc lint + arc unit
Event Timeline
Please run Clang-format and Clang-tidy modernize over newly added code.
include/clang/AST/CHashVisitor.h | ||
---|---|---|
2 | Please loon onto other headers for inclusion guards style. | |
10 | No empty lines between header groups. | |
15 | Unnecessary empty line. | |
19 | Please add empty line below. | |
154 | } // namespace CHashConstants Please add empty line above. | |
160 | Please use using instead of typedef same in other places. | |
161 | Please add empty line above. | |
447 | Unnecessary lines. | |
451 | } // namespace clang | |
452 | #endif // <inclusion guard> |
We already have mechanisms to hash the AST. I'm strongly opposed to adding another one (and requiring AST modifications to update yet more such mechanisms).
Please look at the work that Richard Trieu has been doing recently to create stable-across-TUs hashes of statements and declarations, start a conversation on cfe-dev about integrating your approach and the existing Profile approach, and try to reach consensus on a way forward that doesn't leave us with multiple different hashing mechanisms. Replacing the Profile mechanism with a tablegen-driven one such as the one in this patch (and then generating the various hashing functions with their distinct requirements from that shared description) does seem like it could be a reasonable approach to me, but it needs to be a replacement rather than keeping both mechanisms around.
@Eugene.Zelenko Thank you for pointing me out on these issues. I ran clang-tidy and clang-format on CHashVisitor.h
@rsmith You're right, there is already more than one implemenation of {partial,unstable} AST hashes within LLVM, as we already discussed on cfe-dev in August[1]. Therefore, I rewrote our original CHash implementation to extend the already existing StmtDataCollectors approach. However, you are right, efforts should be coordinated to get a AST hashing implementation that can be adapted to various use-case scenarios.
[1] http://lists.llvm.org/pipermail/cfe-dev/2017-August/054911.html
I think it would be quite nice if we manage to consolidate the implementation of hashing mechanisms.
Modifications to StmtDataCollectors.td will alter the behaviour of clone detection. Currently, this file assumes that we want to find type 2 clones. So we should maintain that, and add things only where needed.
I submitted this patch so we can have different categories of data collection
https://reviews.llvm.org/D40781
So you can define a category for cHash and add let cHash = [{ .. }] or let cHash = TypeIIClone; if you just want to reuse it.
With this patch you can also use !codeconcat to append some code: https://reviews.llvm.org/D40782
For my changes to StmtDataCollectors: I tried to leave the already existing entries in StmtDataCollector.td untouched and the test-cases still work OK. If there is something somebody could break by changing these files it would be good to have a test case for it.
However, some of the collectors are not node local, which seems a little bit inconsequent to me. For example, DeclStmt includes information about the its Decl Children. In my opinion every entry in the *DataCollector.td should include only information that can be directly retrieved from this particular AST node. Cross-referencing between nodes should be done on the user side of the collectors (as it is done in CHashVisitor.h, which follows pointers in a cross-tree manner). With this separation of concerns it was quite easy for me to base CHash of the TableGen'ed headers. Furthermore, a purely syntactical hash (only the children, no cross-tree dependencies) could be derived easily.
@johannes For the different types of information that should be included for one AST node (e.g., Type II clones): We could use a hierachy of "What should be considered here". But we should clearly define what is the definition of equivalency for that level. For example, if you wake me up at night I could not define you the difference between a type I and a type II clone.
Yeah, the tests are not comprehensive unfortunately
However, some of the collectors are not node local, which seems a little bit inconsequent to me. For example, DeclStmt includes information about the its Decl Children. In my opinion every entry in the *DataCollector.td should include only information that can be directly retrieved from this particular AST node. Cross-referencing between nodes should be done on the user side of the collectors (as it is done in CHashVisitor.h, which follows pointers in a cross-tree manner). With this separation of concerns it was quite easy for me to base CHash of the TableGen'ed headers. Furthermore, a purely syntactical hash (only the children, no cross-tree dependencies) could be derived easily.
Good point, I think we can move the code for DeclStmt to CloneDetection.cpp
@johannes For the different types of information that should be included for one AST node (e.g., Type II clones): We could use a hierachy of "What should be considered here". But we should clearly define what is the definition of equivalency for that level. For example, if you wake me up at night I could not define you the difference between a type I and a type II clone.
Yeah, it's difficult to clearly define that though. The thing I noticed is that the function name is included for PredefinedExpr, but Type II clones do not care about values, but only structure. I think most of your additions should be fine , @teemperor could confirm them.
I removed the magic constants and added the number of children for various AST nodes. This should avoid hash collisions.
[CHash] Stable TU-level AST Hashing for local and global hashes
This patch does integrate the AST hashing parts of the CHash project[1][2] into
CLang. The extension provided by this patch accomplishes 2 things:
- local hashes: It extends StmtDataCollector infrastructure by hash rules for declarations, types, and attributes. These rules are expressed in terms of TableGen files and capture what parts of a AST node are important to its semantic. Thereby, the rules only include information that is local to the respecting node (no pointers to other AST nodes should be followed).
Since these DataCollectors do not impose rules how the AST nodes relate to each other, they provide a mechanism for "local" hashes.
- global hashes: The CHashVisitor combines the local hashes via the RecursiveAST Visitor into a global semantic hash that includes (for a given AST node) all local hashes of all AST nodes that can influence the compilation of this AST node.
With these global hashes, we can deduce (early) in the compilation process wether the result of this compiler run will produce the same object file. This mechanism is *NOT* part of this commit.
[1] https://www.sra.uni-hannover.de/Research/cHash/
[2] "cHash: Detection of Redunandt Recompilations via AST hashing", USENIX 2017
include/clang/AST/CHashVisitor.h | ||
---|---|---|
73–80 | Should these also be storing the hashes of the elements in addition to storing the size? | |
212 | I thought the idea was to have all the code in the *Collectors.td but here you have a bunch of Visit* functions for AST nodes. This seems counter to your point of having them generated. | |
213–214 | I don't see this example in the test. | |
227 | Can't you query the hash for the Decl instead of doing work here? | |
254 | Is this possible recursion included in your test? | |
256 | Shouldn't this be handled in the VisitFieldDecl function? | |
include/clang/AST/DeclDataCollectors.td | ||
6–8 | Why is this randomness needed? | |
10 | What do you mean by "CrossRef"? | |
61–62 | Does that mean you consider: enum E { E1, E2, E3 }; to be equivalent to: enum E { E1 = 0, E2 = 1, E3 = 2 }; ? |
I am not sure if this is the right time but maybe testing and comparing the behavior of this patch against something such as the ODRHash will help you, me (and probably others) to understand the current patch better.
In D41416, I started working on lazy deserialization of template specializations. When you produce a PCM or PCH and a template instantiation is required it'd currently deserialize eagerly all possible template specializations (and standard libraries have a lot of them). I am computing the hashes of the template arguments and comparing against the usual Profile-ed hashes. In the assert, essentially I am comparing the behavior of the ODRHash and the Stmt::Profile logic for some cases.
Perhaps you can plugin your algorithm to it, create a pch and start comparing your logic with the when it comes to template arguments hashing and alike.
Of course this is not something that will give you perfect coverage and so on but it can stress test the implementation and compare it to the 2 major existing ones in clang.
include/clang/AST/CHashVisitor.h | ||
---|---|---|
11 | I suspect this is a copy-paste from the APValue class. Could you update the documentation what the file contains? | |
292 | Clang uses /// for doxygen style documentation. |
include/clang/AST/CHashVisitor.h | ||
---|---|---|
73–80 | Ok. So this is called from
All of these occurrences are vistied by the RecursiveASTVisitor. However, we have no Idea how many children are visited by these for loops. Therefore, we additionally call addData(ArrayRef<...> x) to indicate give the actual user of StmtCollector* the possibility to add these lengths to the hash. This is also the reason, why we add only the std::distance to the length. I will add a comment to this. | |
212 | See Mail. | |
213–214 | It is now reflected in the tests. | |
227 | How would I do it. As the example shows, there is no hash for a, when we try to calculate the hash for a. However, I included a unittest to cover this. | |
254 | It now is. | |
256 | I must avoid to call addData(D->getType()) in VisitValueDecl for FieldDeclarations. So there would also be a conditional. | |
include/clang/AST/DeclDataCollectors.td | ||
6–8 | It is there for my personal feeling of less hash collisions. In former versions of this patch there was a large random constant for every type of AST node to decrease the likelyness of collisions when we left out some information. Small numbers (like enum values) often occur in source code. A 32 Bit Pseudo-Random value is unlikely to be seen in real source code. However, if you dislike its, I can remove the llvm::hash_value here. | |
10 | CrossRef indicates that the following fields are not local to the node. However, we give them to the user of the DataCollector to give it a chance to include the number of children into its hash. This directly relates to the addData(ArrayRef<...>) question above. |
Please loon onto other headers for inclusion guards style.