This is an archive of the discontinued LLVM Phabricator instance.

New ODR checker for modules
ClosedPublic

Authored by rtrieu on Jun 23 2016, 10:39 PM.

Details

Reviewers
dblaikie
rsmith
Commits
rG93772fcfc7ca: [ODRHash] Add handling of bitfields
rG8459ddf12a88: [ODRHash] Add handling of TypedefType and DeclarationName
rGbcaaf96674ce: [ODRHash] Handle types in ODR hashing.
rGd0786099b189: [ODRHash] Add IdentiferInfo and FieldDecl support.
rG639d7b68d656: [ODRHash] static_cast and Stmt hashing.
rGe7f7ed2be782: Add more ODR checking.
rGb6adf54204b9: Part of adding an improved ODR checker.
rGcb6b72628e63: Add better ODR checking for modules.
rGf351ac8987fe: Add better ODR checking for modules.
rGfa3d93a148d4: Add better ODR checking for modules.
rC296170: [ODRHash] Add handling of bitfields
rC296078: [ODRHash] Add handling of TypedefType and DeclarationName
rC295931: [ODRHash] Handle types in ODR hashing.
rC295911: [ODRHash] Add IdentiferInfo and FieldDecl support.
rC295890: [ODRHash] static_cast and Stmt hashing.
rC295800: Add more ODR checking.
rC295533: Part of adding an improved ODR checker.
rC295421: Add better ODR checking for modules.
rC295284: Add better ODR checking for modules.
rC293585: Add better ODR checking for modules.
rL296170: [ODRHash] Add handling of bitfields
rL296078: [ODRHash] Add handling of TypedefType and DeclarationName
rL295931: [ODRHash] Handle types in ODR hashing.
rL295911: [ODRHash] Add IdentiferInfo and FieldDecl support.
rL295890: [ODRHash] static_cast and Stmt hashing.
rL295800: Add more ODR checking.
rL295533: Part of adding an improved ODR checker.
rL295421: Add better ODR checking for modules.
rL295284: Add better ODR checking for modules.
rL293585: Add better ODR checking for modules.
Summary

Early prototype of an improved ODR checker for Clang and modules. The current ODR checking of classes is limited to a small number of attributes such as number of methods and base classes. This still allows a number of ODR violations to pass through undetected which only manifests as problems during linking. This improved ODR checker calculates a hash based on the class contents, then checks that the two decls have the same hash before allowing them to be merged. The test case shows a number of ODR violations that the current Clang checker would not have caught.

The hashing relies on three visitors for Stmt's, Decl's, and Type's. For Stmt's, the visitor behind Stmt::Profile was refactored so that a hash could be generated without depending on pointers. For Decl's, a new DeclVisitor was created. The Type visitor has not been written yet. Instead, Types are converted into a string as a stand-in.

The other area for improvement is the diagnostic message. Most have the default message stating the class has different definitions in different modules. New specific messages will need to be created to supplement the default message.

Diff Detail

Event Timeline

rtrieu updated this revision to Diff 61756.Jun 23 2016, 10:39 PM
rtrieu retitled this revision from to New ODR checker for modules.
rtrieu updated this object.
rtrieu added a subscriber: cfe-commits.
rtrieu updated this revision to Diff 68278.Aug 16 2016, 8:02 PM

Add function void ODRHash(llvm::FoldingSetNodeID &ID) to several classes for computing the hash. Decl, Stmt, TemplateArgument, Type and QualType now have this function, and can call among each others' functions.

dblaikie added inline comments.Aug 17 2016, 9:53 AM
lib/AST/DeclBase.cpp
1810–1812 ↗(On Diff #68278)

Inconsistent {} on single line block (in VisitEnumConstantDecl above {} are not used on a single line block) - usually drop the {} on single line blocks.

(several other instances in this patch)

test/Modules/Inputs/odr_hash/first.h
6–9

It might make this test more readable if all the types were in one file, maybe like this:

#ifdef FIRST
  struct S2 { friend S2Friend1 };
#endif
  struct S2 { friend S2Friend2 };
#endif

Etc... - and it could just be a textual header that the two headers include (one header #defines/#undefs the appropriate thing and one doesn't). But maybe that's too complicated - I'm not sure.

bruno added a subscriber: bruno.Aug 30 2016, 11:19 AM
rtrieu updated this revision to Diff 73388.Oct 3 2016, 7:32 PM

Add a more detailed error message to let users know where the two records differ. This replaces the generic error which only stated that two definitions are different without any details.

rtrieu marked 2 inline comments as done.Oct 3 2016, 7:36 PM
rtrieu added inline comments.
lib/AST/DeclBase.cpp
1810–1812 ↗(On Diff #68278)

These should be more consistent now.

test/Modules/Inputs/odr_hash/first.h
6–9

Modified so entire test is now in one file. The separate headers and module maps are generated into the temp folder for each run.

rsmith edited edge metadata.Oct 3 2016, 8:26 PM

There are a bunch of cases here that do this:

if (auto t = getThing())
  ID.addThing(t);
if (auto t = getOtherThing())
  ID.addThing(t);

That will result in hash collisions between objects with thing and objects with otherthing (for instance, struct A { int n : 1; }; and struct A { int n = 1; }; appear to hash the same).

And there are some cases where you add a list of objects without first adding the size, which is liable to allow collisions.

I'm not too worried about these cases, since this is just a hash anyway, and is only a best-effort attempt to check for ODR violations, but some of them look like they'd be easy enough to fix, so I figure why not :)

Anyway, this looks really good. Do you know if the computation of the ODR hash has any measurable performance impact when building a large module? I'm not really expecting one, but it seems worth checking just in case.

lib/AST/DeclBase.cpp
1827 ↗(On Diff #73388)

You should not include the default argument if it was inherited from a previous declaration. That can happen in a friend declaration:

// module 1
void f(int = 0);
struct X { friend void f(int); }; // has (inherited) default arg
// module 2
struct X { friend void f(int); }; // has no default arg
rtrieu updated this revision to Diff 75970.Oct 26 2016, 5:07 PM
rtrieu edited edge metadata.
rtrieu marked 2 inline comments as done.

There are a bunch of cases here that do this:

if (auto t = getThing())
  ID.addThing(t);
if (auto t = getOtherThing())
  ID.addThing(t);

That will result in hash collisions between objects with thing and objects with otherthing (for instance, struct A { int n : 1; }; and struct A { int n = 1; }; appear to hash the same).

And there are some cases where you add a list of objects without first adding the size, which is liable to allow collisions.

I'm not too worried about these cases, since this is just a hash anyway, and is only a best-effort attempt to check for ODR violations, but some of them look like they'd be easy enough to fix, so I figure why not :)

The hashing now does:

auto t = getThing();
ID.addBoolean(t);
if (t)
  ID.addThing(t);

The added Boolean value prevents the hash collision.

Anyway, this looks really good. Do you know if the computation of the ODR hash has any measurable performance impact when building a large module? I'm not really expecting one, but it seems worth checking just in case.

I ran some comparisons and did not see any performance impact.

lib/AST/DeclBase.cpp
1827 ↗(On Diff #73388)

The visitor for friend decl only uses the type (void (int)) and name (f) of the friend. It doesn't process the type, so default arguments doesn't matter for this case.

rtrieu updated this revision to Diff 82702.Dec 29 2016, 4:49 PM

This is a redesign of the ODR checker which was discovered to have several shortcomings when run over test cases. The old version mainly used depth-first processing of AST nodes to gather the information to be hashed. In some instances, a recursive loop developed preventing the termination of the hashing function.

This version of the ODR checker will queue Type and Decl pointers into a vector. When the hashing function needs to refer to the pointers, it will use the index of the pointers' location in the vector instead. After any in progress hashing is finished, unprocessed pointers in the vector will be processed. Other AST nodes, such as Stmt's and TemplateArgument's are processed when received.

The design change also necessitated creation of an ODRHash class to manage the queued nodes. This is in ODRHash.{cpp,h} and most of the hashing logic is also moved into those files. Only the hashing for Stmt is left in StmtProfile.h since it shares code with Stmt::Profile.

Thanks! I assume there's still no measurable performance impact?

include/clang/AST/ODRHash.h
54 ↗(On Diff #82702)

Is this always the same as Pointers.size()?

lib/AST/ODRHash.cpp
35 ↗(On Diff #82702)

AddBoolean for clarity maybe?

335–337 ↗(On Diff #82702)

You will presumably need to filter out the implicit declarations before you emit the size of the list, otherwise a DeclContext with an implicit declaration will hash differently from one without.

493–495 ↗(On Diff #82702)

Given that AddStmt already writes out an "is null" flag, could you use AddStmt(hasDefaultArgument ? D->getDefaultArgument() : nullptr) here?

lib/Serialization/ASTReader.cpp
9015–9027 ↗(On Diff #82702)

Can you factor this out into a local lambda or helper class? These dozen lines are repeated quite a lot of times here with only small variations.

rtrieu updated this revision to Diff 84340.Jan 13 2017, 11:35 AM
rtrieu marked 5 inline comments as done.Jan 13 2017, 11:38 AM

Comments have been addressed. I will be testing it for performance impact next.

include/clang/AST/ODRHash.h
54 ↗(On Diff #82702)

Yes it is. It has been removed and Pointers.size() is used now.

rsmith accepted this revision.Jan 13 2017, 12:01 PM
rsmith edited edge metadata.

Thanks, looks good assuming your performance testing doesn't uncover anything.

lib/AST/ODRHash.cpp
319–321 ↗(On Diff #84340)

I think you can remove these lines: no-one should be calling this function with a null declaration or an implicit declaration.

This revision is now accepted and ready to land.Jan 13 2017, 12:01 PM
rtrieu marked an inline comment as done.Jan 13 2017, 2:27 PM

From testing, there is no difference when compiling with pre-compiled headers. However, when building the headers, there is a 3-4% impact on compile time.

After changing the ODRHash class a bit, the new performance numbers are 3% in debug and 1-1.5% in release builds.

I feel like we have a really similar use case in the Analysis/CloneDetection{.h/.cpp} with the hashing of statements. I tried substituting the call to the statement hashing with a call to the CloneDetection API and it seems that most tests pass and the remaining fails are just minor fixable differences (like ::foo() and foo() having different hash codes).

Would be nice if we could make Stmt::Profile, ODRHash and the CloneDetection use the same backend. We improved a few things that we no longer have the periodic reallocs from the vector inside NodeIDSet and that we use MD5. Also there are are some future plans on how we can better prevent regressions when we add/extend AST nodes. Thoughts?

lib/AST/ODRHash.cpp
11 ↗(On Diff #84340)

"Instruction class" -> "ODRHash class"

I feel like we have a really similar use case in the Analysis/CloneDetection{.h/.cpp} with the hashing of statements. I tried substituting the call to the statement hashing with a call to the CloneDetection API and it seems that most tests pass and the remaining fails are just minor fixable differences (like ::foo() and foo() having different hash codes).

Would be nice if we could make Stmt::Profile, ODRHash and the CloneDetection use the same backend. We improved a few things that we no longer have the periodic reallocs from the vector inside NodeIDSet and that we use MD5. Also there are are some future plans on how we can better prevent regressions when we add/extend AST nodes. Thoughts?

It would help to understand your use case better. Does CloneDetection care about ::foo versus foo? How about different types with the same name? ODR checking assumes identical token sequences, so it does care about extra "::". ODR checking also will process information about type while CloneDetection looks like it only uses the type name.

I see that CloneDetector uses both llvm::MD5 and llvm::FoldingSetNodeID, and only adds data via StringRef. MD5 will add the bytes as data, while FoldingSetNodeID will add the size of the string, then the bytes as data. This means MD5 may have collisions when two strings are added back to back while FoldingSetNodeID will store extra data for every integer processed. FoldingSetNodeID doesn't have this problem if AddInteger is used. Were the reallocs a big problem for CloneDetection?

I don't thinking merging Stmt::Profile, ODRHash, and CloneDetection would be a good idea unless the hashes needed are very similar. Stmt::Profile and ODRHash already share a base for Stmt processing, which might be extendable to CloneDetection as well, but a full merge may be difficult.

Sadly, we don't have a good story for when an AST node gets changed with new properties. The Stmt profiler does declare all the visit methods in the class definition, so that will catch any new Stmt nodes without a new visit method.

Would be nice if we could make Stmt::Profile, ODRHash and the CloneDetection use the same backend.

This code is *already* reusing the Stmt::Profile code for hashing of statements. Why was a different mechanism invented for CloneDetection? If it doesn't have different requirements, maybe it should be rewritten in terms of the Stmt::Profile implementation too.

davide added a subscriber: davide.Jan 26 2017, 8:16 PM
rtrieu updated this revision to Diff 86142.Jan 27 2017, 3:57 PM

Changes made to the ODR hash algorithm:

Separated Decl and Type pointers into their own DenseMap's.
Removed the queue of pointers to process at the end. Instead, process pointers at first use.
Save Boolean values and add them together at the end to reduce bits wasted. Shrinks data from bools 32x.

With these changes, the overhead is now 1-1.5%

This revision was automatically updated to reflect the committed changes.

This patch will be landing in small chunks to hopefully avoid the large reverts.