This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Added a reusable constraint system to the CloneDetector
ClosedPublic

Authored by teemperor on Aug 11 2016, 12:50 PM.

Details

Summary

A big part of the clone detection code is functionality for filtering clones and clone groups based on different criteria. So far this filtering process is hard coded into the
CloneDetector which makes it hard to understand and extend.

To fix this, this patch implements a system of reusable constraints that are used to filter clones. The system is build upon the Constraint class and its subclasses which each filter one type of clones. To actually filter a set of clones, a list of constraints needs to be created and passed to the CloneDetector which apply them one after another on a list of CloneGroups.

This patch also migrates all previously hard coded constraints to the new system.

Diff Detail

Repository
rL LLVM

Event Timeline

teemperor planned changes to this revision.Aug 11 2016, 12:50 PM
teemperor updated this revision to Diff 67730.
teemperor retitled this revision from to [analyzer] Added a pass architecture to the CloneDetector.
teemperor updated this object.
teemperor added reviewers: v.g.vassilev, NoQ, zaks.anna.
teemperor added a subscriber: cfe-commits.
  • Add description
  • Make sure functions in CloneDetection.cpp are in a more logical order.
teemperor updated this revision to Diff 68680.Aug 19 2016, 6:47 AM
teemperor retitled this revision from [analyzer] Added a pass architecture to the CloneDetector to [analyzer] Added a reusable constraint system to the CloneDetector.
teemperor updated this object.
  • Rebased patch.
  • Renamed passes to constraints.
v.g.vassilev added inline comments.Aug 20 2016, 9:45 AM
include/clang/Analysis/CloneDetection.h
235 ↗(On Diff #68680)

I am still unsure about the name. It is too generic. What about CloneConstraint?

NoQ added inline comments.Aug 22 2016, 9:00 AM
include/clang/Analysis/CloneDetection.h
243 ↗(On Diff #68680)

I might be wishing a lot, but i've a feeling this interface should be more obvious, i.e. understandable without reading the cpp code.

For example, what is the difference between acceptsHashGroup and acceptsGroup? I don't think the difference is about accepting different kinds of groups ("hash"-groups vs. "non-hash" groups). If two items X and Y are similar in some sense (and here we have methods with similar names and similar prototypes, and the reader would instantly notice that), it's a good trick to first confirm that these are similar, and then highlight the difference: "However, unlike X, the Y is...".

Another example - are these accept... functions supposed to be "pure" (just have a look at the sequence or a group and say if we accept it), or are they supposed to change state of the constraint object? I wish it was obvious from just looking at the interface.

If i wished for even more, i wish this interface could consist of only one method. It seems that this interface is complicated because of performance considerations. So i'd probably want to know more about these considerations, how exactly do all the methods fit together and what is the efficient way of using them.

In order to solve this problem, probably the comment before the class could be extended to explain the big picture. I'd also do my best to understand things better, and probably come up with a wording that would look better from a bystander point of view. Here are a few examples of what seems easier to understand:

  • "/// First, CloneDetector filters out groups that are obviously out of place... Then, it takes one prototype from each group and repeatedly asks the Constraint object if other clones of the group are similar to the prototype..."
  • "/// If it is obvious just by looking at this group that all sequences in it shouldn't ever be reported, override this callback to notify CloneDetector about that and save some performance..."
  • "/// In this callback you may record the property of the prototype you're interested in, so that later you could quickly compare other sequences to it..."
NoQ edited edge metadata.EditedAug 26 2016, 7:32 AM

Here's some pseudo-code of the way i see it.

// This interface mimics CloneDetector's interface, hence omnipotent but useless.

class BasicConstraint {
public:
  virtual void add(const StmtSequence &S) = 0;
  virtual vector<CloneGroup> findClones() = 0;
};
// This constraint separates statements by their hash values.
// Useful for the first pass, when we can't afford the number of sequences
// to slow us down.

class HashingConstraint: public BasicConstraint {
  map<hash_t, CloneGroup> M;

public:
  virtual hash_t hash(const StmtSequence &S) = 0;

  virtual void add(const StmtSequence &S) override {
    M[hash(S)].append(S);
  }
  virtual vector<CloneGroup> findClones() override {
    vector<CloneGroup> V;
    for (I in M)
      V.append(I.second);
    return V;
  }
};
// This interface does pairwise comparisons via the provided compare() function.
// Quadratic but easy to use for later passes.

class ComparingConstraint {
  vector<CloneGroup> V;

public:
  virtual void compare(const StmtSequence &LHS, const StmtSequence &RHS) = 0;

  virtual void add(const StmtSequence &S) override {
    for (auto G in V) {
      if (compare(G[0], S))
        G.append(S);
      else
        V[V.length()].append(S);
    }
  }
  vector<CloneGroup> findClones() override {
    return V;
  }
};

And inherit custom constraints from these building blocks, probably provide more building blocks. Feel free to add const/static stuff and replace virtuals with recursive templates :)

(Quick stylistic comments)

include/clang/Analysis/CloneDetection.h
224 ↗(On Diff #68680)

Use of \brief is deprecated (we enabled auto brief quite some time ago).

417 ↗(On Diff #68680)

= default

419 ↗(On Diff #68680)

No need for virtual on override.

teemperor updated this revision to Diff 77759.Nov 13 2016, 3:10 PM
teemperor updated this object.
teemperor edited edge metadata.
  • Rebased patch to the current trunk state.
  • Replaced runtime polymorphism with templates.
  • Constraint interface now only has one method signature.
teemperor planned changes to this revision.Nov 13 2016, 3:12 PM
  • I ran all real-world tests (sqlite, etc.) before rebasing to trunk. I'm not 100% confident that I correctly merged everything, so I'll rerun them just in case. The normal clang test-suite passes, so it looks good.
teemperor updated this revision to Diff 89084.Feb 20 2017, 1:04 AM
  • Removed all the deprecated \briefs

I couldn't find any actual regression in the code now, so from my side it's ok to merge it.

NoQ added a comment.Mar 7 2017, 7:22 AM

I'm sorry for the delays; i'm humbly requesting a couple more weeks to get back to this awesome stuff.

v.g.vassilev edited edge metadata.Mar 7 2017, 9:52 AM

We were planning adding some extra unittests, too...

NoQ added a comment.Mar 20 2017, 6:57 AM

I think this is awesome.

Would anybody like to move concrete constraint class definitions to a separate header + translation unit, so that to highlight how simple the primary interface is to a person who haven't gone through all the discussion?

include/clang/Analysis/CloneDetection.h
172 ↗(On Diff #89084)

Strange spacing.

178 ↗(On Diff #89084)

Should this pair of functions be made static?

218–219 ↗(On Diff #89084)

While you could enforce this with static_assert<is_base_of<CloneConstraint, T>> in the detector, i think we shouldn't be strictly enforcing this.

Providing useful utility functions is pretty much the only purpose of this class, so it intends to be useful, but this shouldn't block users from implementing the constrain() method completely from scratch.

Also, we could probably move the method that groups clones by hashes here as the third utility method, if more than one hash function is eventually used(?)

318 ↗(On Diff #89084)

This comment looks incomplete.

lib/Analysis/CloneDetection.cpp
308 ↗(On Diff #89084)

I remember some buildbots complained about conflicting names between variables and classes. Could we rename the Stmt parameter? Something like Seq is fine with me.

335–336 ↗(On Diff #89084)

In particular, this method doesn't use any CloneConstraint's utility methods, as far as i understand.

lib/StaticAnalyzer/Checkers/CloneChecker.cpp
81–83 ↗(On Diff #89084)

Yay. I like how it looks.

teemperor marked 5 inline comments as done.Mar 25 2017, 3:46 AM
teemperor added subscribers: xiangzhai, liangqi.
v.g.vassilev added inline comments.Mar 25 2017, 4:39 AM
lib/StaticAnalyzer/Checkers/CloneChecker.cpp
147 ↗(On Diff #89084)

Could you transform the TODO into a FIXME?

172 ↗(On Diff #89084)

Could you add a FIXME here?

Hi Raphael,

Thanks for your awesome job!

I have read about the slide of "LLVM: built-in scalable code clone detection based on semantic analysis", and sent email to Shamil for asking "Is Code clone detection in LLVM compiler infrastructure available?" he replied me that the implementation (in LLVM IR layer?) is unavailable - close source. so I pay more attention to your implementation without LLVM IR, but I am really really really care about:

  1. Performance. I use scan-build -k -v -enable-checker alpha.clone.CloneChecker to static analysis K3B, it almost spend one hour! optimization plan?
  2. False Positive. I use Coverity scan K3B to compare with CloneDetector, there is NO false positive when detecting Copy-paste issue by setting pattern to ignore Qt Meta-Object Compiler and other auto-generated source files. perhaps CloneChecker is able to learn this way to low false positive?

Thanks again for your great job!

Regards,
Leslie Zhai

teemperor updated this revision to Diff 93180.Mar 27 2017, 2:59 PM

Thanks for the review Artem!

Changes:

  • No longer including the old LLVM hashing header.
  • Fixed the messed up comment formatting when i removed all the \briefs...
  • CloneConstraint is now just a collection of static methods and its no longer necessary to inherit from it. (Maybe we can just replace it in the future with a namespace in the future? I don't like free floating functions and they don't fit into the CloneDetector interface, so this felt like the least-ugly way to do it.)
  • Renamed Stmt to Seq to prevent that one warning Artem mentioned.
  • Added a example unittest that shows a bit how to use the API and how to create your own Constraint.
  • StmtSequence now uses Decl* instead of ASTContext. We get the ASTContext over the stored Decl, but with a Decl we can do better heuristics for StmtSequence::contains and we can do fast filtering based on function name etc.
teemperor marked 7 inline comments as done.Mar 27 2017, 3:16 PM

Hey Leslie,

regarding performance: Last time I checked we spend most of the time on the verification of the hash values. We can do some tricks to make this faster (like delaying the verification to the end of the constraints where we have much less clones and therefore less overhead with this)

regarding false-positives: We will do some basic stuff to reduce them soon-ish (e.g. increasing the minimum clone size default value, filtering those generated files). Afterwards we probably need some fancier way of teaching the checker what is a valid clone and what not.

If you have some actual bugs that are found by the checker, please send them to me (teemperor [AT] gmail.com) that we can add them to the regression tests :)

include/clang/Analysis/CloneDetection.h
218–219 ↗(On Diff #89084)

Everything done here beside the hashing, because I think we need to do a few changes there in the future that need some more discussion. For example:

  • Split up hashing and verifying them into different constraints. This could drastically improve performance because verification of hashes id currently the biggest time-consumer on an average code-base (at least that was the case the last time I profiled :) ). Depending on how we do this we might need to rely to a certain degree on the unique-ness property of the hash codes...
  • Unify the hashing we do with the hashing that we use in the AST. I think we should wait here until we are sure that this is the right thing to do and that the code is mature enough.
lib/StaticAnalyzer/Checkers/CloneChecker.cpp
81–83 ↗(On Diff #89084)

\o/

Hi Raphael,

Thanks for your reply!

regarding performance: Last time I checked we spend most of the time on the verification of the hash values. We can do some tricks to make this faster (like delaying the verification to the end of the constraints where we have much less clones and therefore less overhead with this)

regarding false-positives: We will do some basic stuff to reduce them soon-ish (e.g. increasing the minimum clone size default value, filtering those generated files). Afterwards we probably need some fancier way of teaching the checker what is a valid clone and what not.

If you have some actual bugs that are found by the checker, please send them to me (teemperor [AT] gmail.com) that we can add them to the regression tests :)

I will test your updated patch tomorrow by testing with KDE/GNOME real projects.

Regards,
Leslie Zhai

teemperor marked an inline comment as done.Mar 28 2017, 2:55 AM

Well this patch won't change a lot with the false-positives or performance (it's more refactoring) :)

teemperor updated this revision to Diff 93270.Mar 28 2017, 12:29 PM
  • Remove the assert(ChildHash) which is obviously wrong code that just rarely fails.
teemperor updated this revision to Diff 93302.Mar 28 2017, 2:32 PM
  • auto to const auto to be more consistent with LLVM style.
v.g.vassilev accepted this revision.Mar 28 2017, 2:47 PM

I am very happy to see this converging! Thanks for the work and perseverance! LGTM.

This revision is now accepted and ready to land.Mar 28 2017, 2:47 PM
xiangzhai added a comment.EditedMar 28 2017, 6:21 PM

Hi Raphael,

Congratulate the patch was accepted!

Well this patch won't change a lot with the false-positives or performance (it's more refactoring) :)

So may I Teach CloneDetection about Qt Meta-Object Compiler and filter other auto-generated source files base on your patch? thanks a lot!

Regards,
Leslie Zhai

teemperor updated this revision to Diff 93503.Mar 30 2017, 11:02 AM
  • Remove the mysterious unicode-character at the start of CloneDetection.cpp
  • Fixed formatting of the comment in CloneDetectionTest.cpp
  • Fixed comment in StmtSequence::contains that was still talking about checking for translation units even though we compared declarations.

I think this is everything from my side for now :)

Hi Raphael,

Then I will rebase my patch D31320 please review it when you have free time, thanks a lot!

Regards,
Leslie Zhai

NoQ accepted this revision.Apr 4 2017, 8:41 AM

Yay, i think this is good to go. Sorry again for being slow>< I guess i'd go ahead and land this patch soon.

include/clang/Analysis/CloneDetection.h
228 ↗(On Diff #93503)

All right, now this is essentially a namespace :)

Change it to a namespace when merging if you prefer that :). Thanks for the review!

This revision was automatically updated to reflect the committed changes.
NoQ reopened this revision.Apr 5 2017, 8:22 AM

Hmm, reverted because i'm seeing crashes on some buildbots (works for me though).

It's crashing somewhere in saveHash, seems that some Stmts are null. For instance, http://lab.llvm.org:8011/builders/clang-cmake-aarch64-42vma/builds/5998 :

#0 0x00000000015356f8 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x15356f8)
#1 0x0000000001533a58 llvm::sys::RunSignalHandlers() (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x1533a58)
#2 0x0000000001533d0c SignalHandler(int) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x1533d0c)
#3 0x000003ffb5150510 (linux-vdso.so.1+0x510)
#4 0x0000000002b562f0 clang::Stmt::children() (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x2b562f0)
#5 0x00000000029772a4 clang::RecursiveCloneTypeIIConstraint::saveHash(clang::Stmt const*, clang::Decl const*, std::vector<std::pair<unsigned long, clang::StmtSequence>, std::allocator<std::pair<unsigned long, clang::StmtSequence> > >&) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x29772a4)
#6 0x00000000029795a8 clang::RecursiveCloneTypeIIConstraint::constrain(std::vector<llvm::SmallVector<clang::StmtSequence, 8u>, std::allocator<llvm::SmallVector<clang::StmtSequence, 8u> > >&) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x29795a8)
#7 0x00000000026786b0 void clang::CloneDetector::findClones<clang::RecursiveCloneTypeIIConstraint, clang::MinComplexityConstraint, clang::MinGroupSizeConstraint, clang::OnlyLargestCloneConstraint>(std::vector<llvm::SmallVector<clang::StmtSequence, 8u>, std::allocator<llvm::SmallVector<clang::StmtSequence, 8u> > >&, clang::RecursiveCloneTypeIIConstraint, clang::MinComplexityConstraint, clang::MinGroupSizeConstraint, clang::OnlyLargestCloneConstraint) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x26786b0)
#8 0x0000000002678b08 void clang::ento::check::EndOfTranslationUnit::_checkEndOfTranslationUnit<(anonymous namespace)::CloneChecker>(void*, clang::TranslationUnitDecl const*, clang::ento::AnalysisManager&, clang::ento::BugReporter&) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x2678b08)
#9 0x00000000028a9394 clang::ento::CheckerManager::runCheckersOnEndOfTranslationUnit(clang::TranslationUnitDecl const*, clang::ento::AnalysisManager&, clang::ento::BugReporter&) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x28a9394)
#10 0x000000000204b62c (anonymous namespace)::AnalysisConsumer::HandleTranslationUnit(clang::ASTContext&) [clone .part.4452] [clone .constprop.4496] (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x204b62c)
#11 0x000000000206de78 clang::ParseAST(clang::Sema&, bool, bool) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x206de78)
#12 0x00000000019e96b4 clang::FrontendAction::Execute() (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x19e96b4)
#13 0x00000000019c2c70 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x19c2c70)
#14 0x0000000001a64f30 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x1a64f30)
#15 0x0000000000998420 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x998420)
#16 0x0000000000963b94 main (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x963b94)
#17 0x000003ffb4bef68c __libc_start_main (/lib64/libc.so.6+0x1f68c)
#18 0x00000000009950e8 _start (/home/buildslave/buildslave/clang-cmake-aarch64-42vma/stage1/./bin/clang+0x9950e8)
This revision is now accepted and ready to land.Apr 5 2017, 8:22 AM
spatel added a subscriber: spatel.Apr 5 2017, 9:07 AM
In D23418#719139, @NoQ wrote:

Hmm, reverted because i'm seeing crashes on some buildbots (works for me though).

It's crashing somewhere in saveHash, seems that some Stmts are null. For instance, http://lab.llvm.org:8011/builders/clang-cmake-aarch64-42vma/builds/5998 :

Not sure if it helps, but there was this build warning:
/home/buildslave/buildslave/clang-cmake-aarch64-39vma/llvm/tools/clang/lib/Analysis/CloneDetection.cpp:395:36: warning: ‘S’ is used uninitialized in this function [-Wuninitialized]

teemperor updated this revision to Diff 94337.Apr 6 2017, 2:44 AM
  • Renamed variable from 'S' to 'Child' to make the buildbots happy (and because it's more expressive)
  • Fixed the name of the unit test.
teemperor added inline comments.Apr 6 2017, 2:47 AM
lib/Analysis/CloneDetection.cpp
395 ↗(On Diff #94337)

I couldn't reproduce, but from what I assume form the warning and the crash that we confused the compiler here by naming both the loop-variable and the parameter 'S'. I renamed it to 'Child' which /should/ fix it.

NoQ added a comment.Apr 6 2017, 3:56 AM
$ cat test.c
#include <stdio.h>
int main() {
  int i = 5;
  {
    int i = i;
    printf("%d\n", i);
  }
  return 0;
}
$ clang test.c
$ ./a.out
32767
$ clang test.c -Xclang -ast-dump
...
`-FunctionDecl 0x7ff82d8eac78 <test.c:2:1, line:9:1> line:2:5 main 'int ()'
  `-CompoundStmt 0x7ff82d8eb070 <col:12, line:9:1>
    |-DeclStmt 0x7ff82d8eadb0 <line:3:3, col:12>
    | `-VarDecl 0x7ff82d8ead30 <col:3, col:11> col:7 i 'int' cinit
    |   `-IntegerLiteral 0x7ff82d8ead90 <col:11> 'int' 5
    |-CompoundStmt 0x7ff82d8eb010 <line:4:3, line:7:3>
    | |-DeclStmt 0x7ff82d8eae78 <line:5:5, col:14>
    | | `-VarDecl 0x7ff82d8eadd8 <col:5, col:13> col:9 used i 'int' cinit
    | |   `-ImplicitCastExpr 0x7ff82d8eae60 <col:13> 'int' <LValueToRValue>
    | |     `-DeclRefExpr 0x7ff82d8eae38 <col:13> 'int' lvalue Var 0x7ff82d8eadd
...

It seems that in int i = i; the i in the new i, which is already declared by now, just not initialized yet, rather than the old i.

I think for range loops work differently:

#include <cassert>
#include <vector>
#include <iostream>

struct Foo {
int* begin() const { assert(false); }
int* end() const { assert(false); }
void bar() const { std::cout << "Different behavior" << std::endl; }
};

int main() {

std::vector<Foo*> F;
F.resize(1);
for (const Foo* F : F) {
    F->bar();
}

}

teemperor@ftlserver ~/test> clang++ test.cpp -std=c++11 ; and ./a.out
Different behavior

This revision was automatically updated to reflect the committed changes.