This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Make StmtDataCollector customizable
ClosedPublic

Authored by johannes on Aug 14 2017, 1:39 AM.

Details

Summary

This moves the data collection macro calls for Stmt nodes
to lib/AST/StmtDataCollectors.inc

Users can subclass ConstStmtVisitor and include StmtDataCollectors.inc
to define visitor methods for each Stmt subclass. This makes it also
possible to customize the visit methods as exemplified in
CloneDetection.cpp.

Move helper methods for data collection to a new module,
AST/DataCollection.

Add data collection for DeclRefExpr, MemberExpr and some literals, tested
in unittests/AST/DataCollectionTest.cpp.

Diff Detail

Repository
rL LLVM

Event Timeline

johannes created this revision.Aug 14 2017, 1:39 AM

Very well done, I really like this patch! I added a few remarks mostly about the comments that need some small adjusting.

I'm wondering what would be a nice way of creating a StmtDataCollector that is faster but only works for single translation units (e.g. it only hashes the pointer values of decls instead of their qualified name)? The use case here would be Stmt::Profile and the CloneDetector which could be set to a single-TU-mode in the CloneChecker. For Stmt::Profile it would ensure we don't degrade performance from the current version, for the CloneChecker we probably get a reasonable performance boost (as the hashing is currently the last remaining bottle neck).

@arphaman Any suggestions who could review/approve the additions to AST/?

include/clang/AST/DataCollection.h
19 ↗(On Diff #110920)

I think we need another namespace here because I don't like a generic function name like addDataToConsumer in the clang namespace. Maybe clang::data_collection::addDataToConsumer or something like that.

lib/AST/StmtDataCollectors.inc
5 ↗(On Diff #110920)

I know this is from my code, but as you're anyway touching this line, could you fix this comment? It should be "that non-macro-generated code` instead of that macro generated. Thanks!

lib/Analysis/CloneDetection.cpp
172 ↗(On Diff #110920)

I think this line can be removed now.

174 ↗(On Diff #110920)

what a code clone is -> what a type II code clone is

210 ↗(On Diff #110920)

Should be Type II clones ignore variable names and literals, so let's skip them

arphaman edited edge metadata.Aug 15 2017, 2:47 AM
arphaman added a subscriber: rsmith.

Very well done, I really like this patch! I added a few remarks mostly about the comments that need some small adjusting.

I'm wondering what would be a nice way of creating a StmtDataCollector that is faster but only works for single translation units (e.g. it only hashes the pointer values of decls instead of their qualified name)? The use case here would be Stmt::Profile and the CloneDetector which could be set to a single-TU-mode in the CloneChecker. For Stmt::Profile it would ensure we don't degrade performance from the current version, for the CloneChecker we probably get a reasonable performance boost (as the hashing is currently the last remaining bottle neck).

@arphaman Any suggestions who could review/approve the additions to AST/?

I guess @rsmith, but he's usually busy. I don't think this patch requires some special approval though.

This patch adds support for things like literals and DeclRefExpr and MemberExpr, but they're untested. They should either be added in a follow-up patch or tested with an appropriate unittest in unittests/AST or unittests/Tooling.

johannes updated this revision to Diff 111316.Aug 16 2017, 1:02 AM
johannes edited the summary of this revision. (Show Details)
johannes removed a subscriber: rsmith.
  • fix comments
  • add unittest for the new node kinds
  • make addData() take a const reference
johannes updated this revision to Diff 111317.Aug 16 2017, 1:05 AM
johannes added a subscriber: rsmith.

add

Very well done, I really like this patch! I added a few remarks mostly about the comments that need some small adjusting.

I'm wondering what would be a nice way of creating a StmtDataCollector that is faster but only works for single translation units (e.g. it only hashes the pointer values of decls instead of their qualified name)? The use case here would be Stmt::Profile and the CloneDetector which could be set to a single-TU-mode in the CloneChecker. For Stmt::Profile it would ensure we don't degrade performance from the current version, for the CloneChecker we probably get a reasonable performance boost (as the hashing is currently the last remaining bottle neck).

@arphaman Any suggestions who could review/approve the additions to AST/?

So we want at least two different modes

  • unstable / hash pointer members
  • stable / follow pointer members

We can do this by using two macros with different names. So we can split DEF_ADD_DATA in something like COLLECT_DATA_STABLE and COLLECT_DATA_UNSTABLE
Both would share the collection of POD members, so it makes sense to use a third macro for this.

So the code in StmtDataCollectors.inc assumes that DataCollection.h has been included, and that there is a addData() function as well as an ASTContext &Context in the current scope.
These things should be documented in DataCollection.h I guess.

Since StmtDataCollectors.inc resides in lib I have to use relative paths (so the include directive looks different depending on the current file), we have to live with this, right?

teemperor accepted this revision.Aug 23 2017, 2:01 AM

Sorry for the delay, was on vacation.

Since StmtDataCollectors.inc resides in lib I have to use relative paths (so the include directive looks different depending on the current file), we have to live with this, right?

That's right, but do we want this to be in lib/ ? There users can't access this code to build their own DataCollector (at least when they build against an installed clang version where this inc file doesn't get installed). And if someone uses this file from a clang header by accident we get errors about this missing file with installed clang versions (and I think the build bots don't perform install tests, so it probably breaks silently).

We probably also can't just move it into include/ because *.inc files there break some CMake sanity checks IIRC, so I think the best way forward is to tablegen this file which also makes the StmtDataCollectors.inc file more readable.

These things should be documented in DataCollection.h I guess.

Agreed, could you do that?

I think this can go in with just these documentation changes (especially since its GSoC related and we're in the last week), so please add them and then LGTM. Stable hashes and tablegen can be a follow-up patches.

This revision is now accepted and ready to land.Aug 23 2017, 2:01 AM
This revision was automatically updated to reflect the committed changes.