This is an archive of the discontinued LLVM Phabricator instance.

CloneDetection now respects statement specific data when searching for clones.
ClosedPublic

Authored by teemperor on Jul 19 2016, 7:24 AM.

Details

Summary

So far the CloneDetector only respected the class of each statement when searching for clones. This means that nodes that differ in any other attribute are falsely considered to be clones of each other as long as they had the same statement class. As an example, the statements a > b and a < b are currently considered to be clones of each other, even though they are obviously different.

This patch refines the way the CloneDetector collects data from each statement by providing methods for each class that will read the class-specific
attributes. These attributes are for example the operator kind for BinaryOperator statements which would prevent the false-positive result from the example above.

Diff Detail

Repository
rL LLVM

Event Timeline

teemperor updated this revision to Diff 64491.Jul 19 2016, 7:24 AM
teemperor retitled this revision from to CloneDetection now respects statement specific data when searching for clones..
teemperor updated this object.
teemperor added reviewers: v.g.vassilev, zaks.anna, NoQ.
teemperor added a subscriber: cfe-commits.
teemperor planned changes to this revision.Jul 19 2016, 7:27 AM
  • Expand test suite to test newly added code.
v.g.vassilev requested changes to this revision.Jul 20 2016, 5:37 AM
v.g.vassilev edited edge metadata.

I guess the question about the binary size holds here, too. What would be the impact on the binary size?

lib/Analysis/CloneDetection.cpp
104 ↗(On Diff #64491)

duplicate "for example".

145 ↗(On Diff #64491)

Why do we need this? Wouldn't callAddDataStmt properly dispatch to the concrete method call?

306 ↗(On Diff #64491)

In which case this happens?

324 ↗(On Diff #64491)

Could you prefix this as // FIXME:

teemperor updated this revision to Diff 64982.Jul 21 2016, 3:50 PM
teemperor edited edge metadata.
teemperor marked 3 inline comments as done.
  • Added FIXME: and removed duplicate "for example".
teemperor updated this revision to Diff 64983.Jul 21 2016, 3:53 PM
teemperor edited edge metadata.
  • Replaced visitor-style utility functions with chained ifs to reduce binary size.

Binary size increases from 33017160 Bytes to 33060464 Bytes (+40 KiB).

I'm not sure if that's too much for such a minor feature, so I've added two new revisions:

  • One is the original patch with the other mentioned issues fixed (so, same +40 KiB size increase here).
  • The other uses only one collect function which reduces the binary size increase to +2 KiB (but uses chained if's with dyn_cast).

The problem is that the LLVM coding standard isn't a big fan chained ifs with dyn_cast and recommends doing it in the same way as the original patch for performance reasons.

lib/Analysis/CloneDetection.cpp
145 ↗(On Diff #64491)

callAddDataFoo would call all addDataXXX functions for Foo and all parent classes of Foo. So callAddDataStmt would only call addDataStmt because Stmt has no parent class. callAddCompoundStmt would call addDataStmt (through callAddDataStmt) and addDataCompoundStmt and so on.

306 ↗(On Diff #64491)

For example

if (a < b) return 1;

would have two nullptr children in the AST (the missing 'init' and 'else' Stmts):

`-IfStmt 0x2d595f8 <line:3:3, col:22>
  |-<<<NULL>>>
  |-BinaryOperator 0x2d59598 <col:8, col:12> '_Bool' '>'
  | |-ImplicitCastExpr 0x2d59568 <col:8> 'int' <LValueToRValue>
  | | `-DeclRefExpr 0x2d59518 <col:8> 'int' lvalue Var 0x2d59418 'a' 'int'
  | `-ImplicitCastExpr 0x2d59580 <col:12> 'int' <LValueToRValue>
  |   `-DeclRefExpr 0x2d59540 <col:12> 'int' lvalue Var 0x2d59488 'b' 'int'
  |-ReturnStmt 0x2d595e0 <col:15, col:22>
  | `-IntegerLiteral 0x2d595c0 <col:22> 'int' 1
  `-<<<NULL>>>
NoQ edited edge metadata.Jul 22 2016, 1:51 AM

The natural way to avoid both heavy artillery with StmtNodes.inc and low-performance if-chains is to use a Visitor.

(Const)StmtVisitor is implemented as a huge switch, contents of which are auto-generated from StmtNodes.inc. In fact, your approach with StmtNodes.inc essentially re-implements the plain StmtVisitor. The official visitor, of course, also allows you to fall back to the parent class if the child visit method is missing.

So i believe it's worth a try to implement StmtDataCollector as an auxiliary ConstStmtVisitor. I thought the two visitors could be combined if a plain ConstStmtVisitor was used from the start, but even now it seems better to me to use an extra visitor than re-invent it(?)

No other comments so far, looks good :)

teemperor updated this revision to Diff 65086.Jul 22 2016, 8:32 AM
teemperor edited edge metadata.
  • StmtDataCollector is now using ConstStmtVisitor
  • Added tests for StmtDataCollector

Could we stress test the implementation by running on files from llvm's/clang's test suite. For example, for a test TST we could do something like: cat TST >> tmp_TST; cat TST >> tmp_TST and run our our copy-paste checker on tmp_TST. This way we know how many clones to expect and we can cross-check how many we are able to detect.

lib/Analysis/CloneDetection.cpp
98 ↗(On Diff #65086)

I'd prefer to see const Stmt *S. Could you please edit all occurrences?

109 ↗(On Diff #65086)

Use just llvm::StringRef, we don't need const &.

113 ↗(On Diff #65086)

Same here, const std::string &String. Is this caught by clang-format?

test/Analysis/copypaste/test-catch.cpp
29 ↗(On Diff #65086)

Could you add an example like:

void non_compound_stmt1(int x)
  try { x / 0; } catch(...) { x++; }

void non_compound_stmt2(int x, float z)
  try { z / 0; } catch(...) { x++; }
teemperor updated this object.Jul 25 2016, 5:03 PM
teemperor updated this revision to Diff 65854.Jul 27 2016, 6:59 PM
  • Rebased patch.
  • Fixed code style (const type instead of type const).
  • Fixed missing const on some variables in StmtDataCollector.
  • Rewrote a few comments/variable names.
teemperor updated this revision to Diff 65857.Jul 27 2016, 7:06 PM
teemperor marked 5 inline comments as done.
  • Removed duplicate test case.

Is there a specific situation/bug we want to test against with these tests? I looks to me as if they would mainly test against non-determinism (i.e. same statements have different data due to non-determinism).

lib/Analysis/CloneDetection.cpp
113 ↗(On Diff #65086)

It hasn't complained so far or I'm missing a flag. But it's also accepting many non-LLVM like naming schemes, so I assume it's quite forgiving in this regard.

teemperor updated this revision to Diff 65909.Jul 28 2016, 5:27 AM
  • Actually removed the duplicate test now.
NoQ added inline comments.Jul 28 2016, 9:53 AM
lib/Analysis/CloneDetection.cpp
91 ↗(On Diff #65909)

Maybe it's a good idea to introduce a typedef for unsigned here, because it'd be nice to be able to change it (eg., some day we may desire to switch to uint64_t).

119 ↗(On Diff #65909)

Maybe just allocate the right amount of bytes during resize? Some usual trick with rounding up, eg. (x + y - 1) / y.

134 ↗(On Diff #65909)

I noticed something: with this patch, you no longer consider the kind of the statement for supported statement types (even for, say, Expr), because the visitor doesn't automatically fall back to parent classes when a method for the child class is defined. Is this intentional? Or you may want to visit some statements with parent methods manually.

148 ↗(On Diff #65909)

I wonder how much we can save if we use a few pointers-based hashes instead of hashing long fully-qualified names, for stuff like function decls or types.

Uhm, by the way, in fact C++ overloaded functions have same qualified names, do we consider that? I guess sub-statement hashing would most likely take care of that.

test/Analysis/copypaste/test-asm.cpp
1 ↗(On Diff #65909)

Do we really need to prefix all test file names with "test-"? It's kind of obvious :)

teemperor marked an inline comment as done.Jul 28 2016, 5:40 PM
teemperor added inline comments.
lib/Analysis/CloneDetection.cpp
134 ↗(On Diff #65909)

Oh, that's indeed not how it should work! Each parent class should be visited. We could explicitly call the parent class for each visit method but I think having this automated would be good (i.e. like the old patches do it but without the resulting binary size increase). Would appreciate suggestions here :)

148 ↗(On Diff #65909)

I think we'll save a lot of space with that but it only works for a single TU. I think I'll change it until we actually have multi-TU support.

Yeah, the children types prevent overload collisions (or they should if the visitor wasn't broken).

teemperor updated this revision to Diff 66261.Jul 31 2016, 6:25 PM
  • Added typedef for the contents of the data vector.
  • Fixed that StmtDataCollector isn't visiting all parent classes of a statement.
  • Removed 'test-' prefix from test files.
  • Fixed the other problems as pointed out by Artem (Thanks!)
  • Added performance remark to the qualified name strings in the data vectors.

I'm now falling back to the ConstStmtVisitor for visiting the parent statements (see the DEF_ADD_DATA macro) which is probably the best fix for the visitor problem.

clang's binary size increases with this patch to 32890440 bytes from 32863664 bytes (+26 KB).

I've kept the long qualified name strings in the data vectors in for now because that's just a performance optimization. Using the pointer values would make the code non-deterministic and breaks cross-TU support and better solutions are probably out of scope for this patch. I've added a remark that we should fix this when we look into performance optimizations.

NoQ added a comment.Aug 1 2016, 1:32 AM

Congrats on the fantastic hack that nobody else noticed! I've a feeling we cannot add tests for that, because any test would quickly break if we change the hash, though demonstrating that we fixed the problem would still be cool (eg. take two unsupported expressions of different kind but same type, and show that they're no longer reported).

One more minor comment.

lib/Analysis/CloneDetection.cpp
127 ↗(On Diff #66261)

I think resize() zero-initializes the whole new data(?)

142 ↗(On Diff #66261)

This is... Beautiful... o________o
/Approves.

teemperor updated this revision to Diff 66308.Aug 1 2016, 5:59 AM
  • No longer re-initialize memory created from resize().
  • Added test for checking that StmtDataCollector visits all parent classes.
teemperor updated this revision to Diff 66309.Aug 1 2016, 6:08 AM
  • s/super class/base class/g
teemperor updated this object.Aug 1 2016, 8:12 AM
NoQ accepted this revision.Aug 1 2016, 9:42 AM
NoQ edited edge metadata.
v.g.vassilev accepted this revision.Aug 2 2016, 1:18 AM
v.g.vassilev edited edge metadata.
This revision is now accepted and ready to land.Aug 2 2016, 1:18 AM
This revision was automatically updated to reflect the committed changes.