This is an archive of the discontinued LLVM Phabricator instance.

Added basic capabilities to detect source code clones.
ClosedPublic

Authored by teemperor on May 30 2016, 10:40 AM.

Details

Summary

This patch adds the CloneDetector class which allows searching source code for clones.

CloneDetector processes translation units and generates search data for each statement in a translation unit.
Afterwards, it can search for clones based on the generated data.

The way CloneDetector searches for clones relies on a locality-sensitive hash functions that accepts
clang::Stmt* as an input and outputs integer hash values. The generated hash values are generated
in a way that produces identical hash values for similar statements.

Before searching, CloneDetector hashes each statement and stores the hash value.
These hash values are then searched for identical values during the actual clone search and the
related statements are then assumed to be clones of each other.

This initial patch only provides a simple hashing mechanism that hashes the class type of each
statement. Hashing class specific data will be added in future patches.

Also, a simple checker is provided to test the CloneDetector class with normal regression test files.
This checker searches the current translation unit for clones using the CloneDetector class and reports them.

Diff Detail

Repository
rL LLVM

Event Timeline

teemperor updated this revision to Diff 58975.May 30 2016, 10:40 AM
teemperor retitled this revision from to Added ASTStructure for analyzing the structure of Stmts..
teemperor updated this object.
teemperor added reviewers: v.g.vassilev, zaks.anna.
teemperor added a subscriber: cfe-commits.
zaks.anna edited edge metadata.Jun 10 2016, 12:05 PM

Please, pull out the refactor into a separate commit and ask someone who often reviews SemaChecking review it.

lib/AST/CMakeLists.txt
10 ↗(On Diff #58975)

Do we want this code to live in the AST library? Who will be the users of this?

zaks.anna added a subscriber: dcoughlin.
v.g.vassilev requested changes to this revision.Jul 2 2016, 11:11 AM
v.g.vassilev edited edge metadata.
v.g.vassilev added inline comments.
include/clang/Basic/SourceManager.h
1173 ↗(On Diff #58975)

A leftover from something else?

1179 ↗(On Diff #58975)

Could you revise the grammar of this entire sentence?

lib/AST/ASTStructure.cpp
42 ↗(On Diff #58975)

You meant "now" instead of "know"?

88 ↗(On Diff #58975)

I'd add the SourceManager as a member and shorten this code snippet.

93 ↗(On Diff #58975)

You meant "as not to be"?

100 ↗(On Diff #58975)

Is that the LLVM/Clang common notion for documenting private members: (i.e. doxygen disabled) instead of /

146 ↗(On Diff #58975)

Why this is dangling? Also "reset".

202 ↗(On Diff #58975)

Maybe we should issue a warning about it.

366 ↗(On Diff #58975)

Could we document these "magic" prime numbers and move them in a central place. I think this would make it more readable.

404 ↗(On Diff #58975)

Could the implementation of GCCAsmStmt go here in the base class? And GCCAsmStmt to fall back to the base class (I assume this way we would have basic support for MSAsmStmt). This holds for the other cases of this pattern, eg.: *Literal.

unittests/AST/ASTStructureTest.cpp
46 ↗(On Diff #58975)

Why not passing just the ASTContext variable?

80 ↗(On Diff #58975)

Same copy-paste here ;) See above.

103 ↗(On Diff #58975)

I'd move this up and reuse the variables in the rhs of HashA and HashB.

130 ↗(On Diff #58975)

What we will get from compareStmt("if (1 == 2) {}", "if (false) {}")

This revision now requires changes to proceed.Jul 2 2016, 11:11 AM
teemperor updated this revision to Diff 62877.Jul 6 2016, 8:21 AM
teemperor retitled this revision from Added ASTStructure for analyzing the structure of Stmts. to Added basic capabilities to detect source code clones..
teemperor updated this object.
teemperor edited edge metadata.
  • Patch now only provides basic generic hashing and is smaller.
  • Code style is now respecting LLVM/clang guidelines.
  • Documented all non-trivial functions.
  • Moved testing from unittests to normal tests.
  • Added tests for false-positives.
  • Fixed the problems pointed out by Vassil (beside "Is that the LLVM/Clang common notion for documenting private members: (i.e. doxygen disabled) instead of /").
  • No longer patching SourceManager.h.
teemperor updated this revision to Diff 62888.Jul 6 2016, 9:35 AM
teemperor edited edge metadata.
  • Using doxygen-style comments for all private members.
teemperor updated this revision to Diff 62899.Jul 6 2016, 10:09 AM
  • Fixed a few typos in comments and documentation.
v.g.vassilev requested changes to this revision.Jul 7 2016, 2:10 AM
v.g.vassilev edited edge metadata.
v.g.vassilev added inline comments.
include/clang/AST/CloneDetection.h
148 ↗(On Diff #62899)

It might be a good idea to detect cloned comments, too.

153 ↗(On Diff #62899)

Can you move this on the previous line?

166 ↗(On Diff #62899)

This should go on prev line too. Please check all comments for this pattern.

215 ↗(On Diff #62899)

We shouldn't copy (or hope this would be std::moved). Can you pass it as a reference to the argument list?

lib/AST/CloneDetection.cpp
22 ↗(On Diff #62899)

It is a very common pattern in clang the ASTContext to be passed as ASTContext&, this should simplify the body of the constructors.

45 ↗(On Diff #62899)

I'd factor out the Context->getSourceManager() in a local variable.

54 ↗(On Diff #62899)

I am not sure what is the intent but shouldn't that be Stmt const* const*?

56 ↗(On Diff #62899)

Please use auto CS = cast<CompoundStmt>(S);

This logic will crash on void f() try {} catch(...){} In this case we do not generate a CompoundStmt.

64 ↗(On Diff #62899)

Same as above.

84 ↗(On Diff #62899)

Still a floating magic constant. Could you factor it out in a variable with a meaningful name?

112 ↗(On Diff #62899)

Doxygen style /// ?

139 ↗(On Diff #62899)

Could we use a range-based for loop. This looks odd.

161 ↗(On Diff #62899)

Indent.

164 ↗(On Diff #62899)

Could you move these members above. I don't think this is common in the codebase.

293 ↗(On Diff #62899)

Capital letters are more common when naming iterators. What about small i and j here. And down you can use I and you won't need to break the line.

This revision now requires changes to proceed.Jul 7 2016, 2:10 AM
teemperor updated this revision to Diff 63159.Jul 7 2016, 4:30 PM
teemperor edited edge metadata.
  • Fixed the problems pointed out by Vassil (Thanks!)
  • Comments now have less line breaks.
  • Added more documentation to HashValue, it's hash function and the used prime numbers.
  • Added a typedef for StmtSequence::iterator.
  • StmtSequence::end() is now always returning a correct iterator.
teemperor updated this revision to Diff 63161.Jul 7 2016, 4:37 PM
teemperor edited edge metadata.
teemperor marked 5 inline comments as done.
  • Fixed type of StmtSequence::iterator.
teemperor updated this revision to Diff 63262.Jul 8 2016, 11:12 AM
teemperor marked 7 inline comments as done.
NoQ added a subscriber: NoQ.Jul 8 2016, 11:25 AM
teemperor updated this revision to Diff 63266.Jul 8 2016, 11:34 AM
  • StmtSequence is now treating all Stmt objects as const.

Hi,

Thank you for working on this!

Here are several comments from a partial review.

Can you talk a bit about how do you envision to generalize this to copy and paste detection, where the pasted code is only partially modified? Do you think this could be generalized in that way?

include/clang/AST/CloneDetection.h
20 ↗(On Diff #63266)

Please, check if there are better data structures you could be using: http://llvm.org/docs/ProgrammersManual.html#map-like-containers-std-map-densemap-etc

131 ↗(On Diff #63266)

We are comparing the Stmt pointers here. Does this make results non-deterministic?

145 ↗(On Diff #63266)

Let's not talk about multiple TU here since much more is needed to support them. Maybe we should just say that, first, given a TU, the class generates the clone identifiers (hashes?) of statements in it (by running AnalyzeTranslationUnitDecl)?

The description of the method already states that it can be called multiple times.

161 ↗(On Diff #63266)

StmtData is too generic of a name. Maybe it could be called something like code signature, clone signature, or code hash?

166 ↗(On Diff #63266)

serves -> serve

186 ↗(On Diff #63266)

Could this return an llvm Optional instead of the error code?

201 ↗(On Diff #63266)

database -> map?

I would just say "Stores the StmtSequence with its associated data to allow future querying."

209 ↗(On Diff #63266)

Please, do not talk about "all translation units" here. Maybe say something like previously seen/stored statements/code or something like that.

217 ↗(On Diff #63266)

Make 10 into a constant. It probably should be a checker parameter.

include/clang/StaticAnalyzer/Checkers/Checkers.td
668 ↗(On Diff #63266)

This should be an "alpha" package until it's considered to be ready for end users.

lib/AST/CMakeLists.txt
10 ↗(On Diff #63266)

Again, I do not think this should be in the AST. Who the users of this will be? If you envision users outside of Clang Static Analyzer, maybe we could add this to lib/Analysis/.

lib/AST/CloneDetection.cpp
56 ↗(On Diff #63266)
103 ↗(On Diff #63266)

This should probably be a utility instead of something specific to this checker. Do we have something similar to this in LLVM already?

lib/StaticAnalyzer/Checkers/CloneChecker.cpp
45 ↗(On Diff #63266)

Is it possible to report these through BugReporter?

NoQ added a comment.Jul 11 2016, 10:32 AM

Your code is well-written and easy to understand, and makes me want to use it on my code! Added some minor comments, and there seems to be a small logic error in the compound statement hasher.

Not sure if that was already explained in detail, but i seem to agree that the only point for this project to integrate into Clang Static Analyzer is to integrate with the BugReporter facility - at least optionally: it would allow the reports of the clone detector checker to be gathered and viewed in a manner similar to other checkers - i.e. through scan-build/scan-view, probably CodeChecker and other tools. That should make the check easier to run and attract more users. However, the BugReporter mechanism is tweaked for the analyzer's path-sensitive checkers, so it'd need to be modified to suit your purpose, so in my opinion this is not critical for the initial commit.

lib/AST/CloneDetection.cpp
52 ↗(On Diff #63266)

Perhaps we could early-return when we see that StartIsInBounds is false (for performance, because isBeforeInTranslationUnit looks heavy).

88 ↗(On Diff #63266)

It seems that the last index is off-by-one, i think you meant something like:

H(D) = PrimeFactor^(N-1) * D[0] + PrimeFactor^(N-2) * D[1] + ... + D[N-1]
175 ↗(On Diff #63266)

Shouldn't it say something like "EndIterator = StartIterator"? Because when StartPos is non-zero, in your code we get [StartPos, Length) instead of [StartPos, StartPos + Length). In your code, Length acts as some kind of EndPos instead.

194 ↗(On Diff #63266)

Simplifies to:

StmtSequence ChildSequence(Child, Context);
238 ↗(On Diff #63266)

I think that the code that deals with computing data for sections of compound statements (as opposed to saving this data) should be moved to VisitStmt(). Or, even better, to VisitCompoundStmt() - that's the whole point of having a visitor, after all. Upon expanding the complexity of the hash, you'd probably provide more special cases for special statement classes, which would all sit in their own Visit method.

That's for the future though, not really critical.

262 ↗(On Diff #63266)

This code is duplicated from the beginning of the function, which synergizes with my point above: if SaveData() contained only that much, then it could have been re-used on both call sites.

lib/StaticAnalyzer/Checkers/CloneChecker.cpp
40 ↗(On Diff #63266)

TU->getASTContext().getTranslationUnitDecl() is always equal to TU - there should, in most cases, be only one TranslationUnitDecl object around, but even if there'd be more some day, i think this invariant would still hold.

43 ↗(On Diff #63266)

An idea: because this function optionally accepts MinGroupComplexity, you may probably want to expose this feature as a checker-specific option - see MallocChecker::IsOptimistic as an example, shouldn't be hard.

45 ↗(On Diff #63266)

I think that'd require changes in the BugReporter to allow easily adding extra pieces to path-insensitive reports (which is quite a wanted feature in many AST-based checkers - assuming we want more AST-based checkers to be moved into the analyzer specifically for better reporting, integration with scan-build, and stuff like that).

test/Analysis/copypaste/test-min-max.cpp
39 ↗(On Diff #63266)

Perhaps add a // no-warning comment to spots that should not cause a warning? This comment has no effect on the test suite, just a traditional way to notify the reader.

teemperor updated this revision to Diff 63851.Jul 13 2016, 1:39 PM
teemperor marked 18 inline comments as done.
  • Checker is now in the alpha package and hidden.
  • MinGroupComplexity is now a parameter for the checker.
  • StmtData is now called CloneSignature.
  • Replaced the std::map inside CloneDetector with a vector and a small cache in the HashVisitor.
  • Moved code from AST to Analysis.
  • Fixed multiple other smaller problems pointed out in the review. (Thanks, Vassil, Anna and Artem!)
teemperor added inline comments.Jul 13 2016, 1:45 PM
include/clang/AST/CloneDetection.h
131 ↗(On Diff #63266)

Good point. The updated patch only sorts by deterministic hash values which should fix this.

lib/AST/CMakeLists.txt
10 ↗(On Diff #63266)

Oh, sorry! I wanted to bring this up in a meeting but it seems I've forgot to do so.

The point with this being in AST is that we might want to merge the Stmt::Profile implementation with the implementation of CloneDtection. And Stmt::Profile is a part of AST AFAIK, so I think we need to stay in the same library? I don't have a good overview over the clang build infrastructure yet, so I would appreciate opinions on that.

If that's something we will handle later, what would be a better place for the code?

lib/AST/CloneDetection.cpp
103 ↗(On Diff #63266)

Yes, there is similar builtin hashing available in llvm::hashing, but the API is designed to work with iterator pairs. So we either have to record all the data in a vector and hash that (which is what FoldingSetNodeID does) or use code that isn't exposed in the API (llvm::hashing::detail::hash_state). The first option creates unnecessary overhead and the second option would require either patching LLVM or using code that isn't part of the public API.

I updated the patch with something that uses FoldingSetNodeID for now. I'll update it later if we can land a patch in LLVM that adds an API for hashing without recording all necessary data beforehand.

NoQ added a comment.Jul 22 2016, 1:37 AM

Regarding the cache stack - it feels easier for me to allocate a separate stack for each statement, and put the stack on stack (!) rather than having it global. This way it'd be automatically cleaned for you when VisitStmt() exits, and you'd be able to address child cache by index rather then through scanning.

lib/Analysis/CloneDetection.cpp
142 ↗(On Diff #63851)

This calculation happens in linear time and each statement is only visited a
fixed amount of times during this process.

Isn't SaveAllSubSequences() quadratic?

424 ↗(On Diff #63851)

I suspect that all i's that make it into the IndexesToRemove vector are already unique and increasing, by construction. We take some i, see if we need to remove it, then push it at most once and take (i + 1).

v.g.vassilev accepted this revision.Jul 22 2016, 12:31 PM
v.g.vassilev edited edge metadata.

LGTM, given the comments are addressed.

include/clang/Analysis/CloneDetection.h
37 ↗(On Diff #63851)

It is more common in the codebase to use Did you mean const Stmt *S. Could you update all similar occurrences to it?

lib/Analysis/CloneDetection.cpp
119 ↗(On Diff #63851)

Same here const StmtSequence &S is the common style in the codebase.

This revision is now accepted and ready to land.Jul 22 2016, 12:31 PM
teemperor updated this revision to Diff 65291.Jul 24 2016, 8:42 PM
teemperor edited edge metadata.

Ok, so I think I've addressed the points from last meeting in this patch:

It was planned to replace custom hashing with FoldingSetNodeID and a hashmap: In this patch I removed all custom hashing. It's now done via LLVM's StringMap and strings which brings the same functionality. Using FoldingSetNodeID is not possible because it hides it's internal data vector which StringMap doesn't support, so I've just used a normal vector instead.

It was also planned to refactor CloneSignatureCache: In this patch I've completely removed the cache because it is no longer necessary. Fixing it was not an option because we assumed that the statements RecursiveASTVisitor visits are the same that Stmt::children() delivers, which is for example for LambdaExprs not true (and RecursiveASTVisitor makes no promise about that in it's API). The only sensible fix for this was to use Artem's initial suggestion to not use RecursiveASTVisitor to traverse statements.

I've also adressed Artems and Vassils comments regarding documentation and code style in this patch (Thanks!).

teemperor marked 9 inline comments as done.Jul 24 2016, 8:46 PM
teemperor added inline comments.
include/clang/AST/CloneDetection.h
149 ↗(On Diff #62877)

Put the idea on the future TODO list, but we probably can't reuse a big part of the current infrastructure for it.

NoQ added a comment.Jul 25 2016, 6:46 AM

The idea with strings as keys is curious! I've got a few more minor comments :)

lib/Analysis/CloneDetection.cpp
104 ↗(On Diff #65291)

Typo: statement.

147 ↗(On Diff #65291)

Maybe pass ChildSignatures by reference?

178 ↗(On Diff #65291)

You'd probably want to add ObjCMethodDecl and BlockDecl, because they aren't sub-classes of FunctionDecl (and probably even tests for that).

Because this part of the code essentially re-implements AnalysisConsumer, (a RecursiveASTVisitor that tells us what code bodies to analyze with the static analyzer - i should've looked there earlier!!).

Alternatively, you can just rely on AnalysisConsumer, which would eliminate the recursive visitor completely:

This way you'd be analyzing exactly the same code bodies that the analyzer itself would analyze; if you'd want to extend to various declarations, you'd be able to do that by subscribing on check::ASTDecl. But i dare not to predict what kind of different flexibility you'd need next.

188 ↗(On Diff #65291)

Micro-nit: Capitalize 'V' in 'visitor'?

teemperor updated this revision to Diff 65439.Jul 25 2016, 4:36 PM
teemperor marked 7 inline comments as done.
  • Fixed the minor problems as pointed out by Artem.
  • Now using AnalysisConsumer instead of RecursiveASTVisitor.
  • Function names are now confirming to LLVM code-style.
  • Renamed DataCollectVisitor to CloneSignatureGenerator (as it's no longer following the Visitor pattern).
  • Added tests for clone detection in ObjC-methods and blocks.
  • Moved false-positive tests into own file.
teemperor updated this revision to Diff 65441.Jul 25 2016, 4:55 PM
  • The CloneGroup values in StringMap no longer store a copy of their own key.
teemperor added inline comments.Jul 25 2016, 4:58 PM
lib/Analysis/CloneDetection.cpp
178 ↗(On Diff #65291)

Thanks for the patch!

teemperor marked 3 inline comments as done.Jul 25 2016, 4:59 PM
NoQ added inline comments.Jul 26 2016, 5:58 AM
lib/Analysis/CloneDetection.cpp
148 ↗(On Diff #65441)

I've a feeling this comment was forgotten :o

teemperor updated this revision to Diff 65506.Jul 26 2016, 6:48 AM
  • Now passing ChildSignatures by const reference.
NoQ accepted this revision.Jul 26 2016, 10:44 AM
NoQ added a reviewer: NoQ.

Great! Will commit.

This revision was automatically updated to reflect the committed changes.