This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Added custom hashing to the CloneDetector.
ClosedPublic

Authored by teemperor on Jul 19 2016, 7:34 AM.

Details

Summary

The CloneDetector consumes in it's current state to much memory to analyze big code bases such as SQLite.

This patch reintroduces the custom hashing functionality which replaces CloneSignature's large data vectors with fixed size hash codes. This includes an additional check for each clone that prevents that the CloneDetector produces false-positives because of collisions in the hash function.

Diff Detail

Repository
rL LLVM

Event Timeline

teemperor updated this revision to Diff 64497.Jul 19 2016, 7:34 AM
teemperor retitled this revision from to Added false-positive filter for unintended hash code collisions to the CloneDetector..
teemperor updated this object.
teemperor added reviewers: v.g.vassilev, zaks.anna, NoQ.
v.g.vassilev requested changes to this revision.Jul 20 2016, 5:39 AM
v.g.vassilev edited edge metadata.

Could you add some tests?

This revision now requires changes to proceed.Jul 20 2016, 5:39 AM
NoQ edited edge metadata.Jul 22 2016, 1:59 AM

Hmm. Since you're working with FoldingSetNodeIDs, maybe you could avoid numeric hashes completely in your code, and use FoldingSetNodeIDs themselves as your keys? Just organize the whole clone list as a map from FoldingSetNodeIDs to lists of particular code samples. There would be no collisions at all from the beginning (even if internally the map would hash the FoldingSetNodeIDs to a number and there will be collisions in this hashing, but you wouldn't need to know about them - and there's no performance loss at the same time).

@teemperor: could you summarize what we discussed at the phone meeting today wrt to this patch.

teemperor edited edge metadata.Jul 25 2016, 5:02 PM
teemperor added a subscriber: cfe-commits.

This patch is no longer needed because the CloneDetector from patch https://reviews.llvm.org/D20795 no longer uses custom hashing.

teemperor abandoned this revision.Jul 30 2016, 10:11 AM
teemperor updated this revision to Diff 67691.Aug 11 2016, 8:13 AM
teemperor edited edge metadata.
  • Rebased the patch to the newest code base.

This patch is live again as it turned out that the CloneDetector can't handle large code bases without it. The simpler approach that we implemented instead consumed too much memory (e.g. at least 15 GB for analyzing SQLite) because the large data vectors and the quadratic space complexity of the sub-sequence search didn't play together very well.

teemperor updated this revision to Diff 67715.Aug 11 2016, 11:29 AM
teemperor retitled this revision from Added false-positive filter for unintended hash code collisions to the CloneDetector. to [analyzer] Added custom hashing to the CloneDetector..
teemperor updated this object.
teemperor edited edge metadata.
  • Fixed some formatting problems.
  • Updated Title/Summary
NoQ added a comment.Aug 15 2016, 8:44 AM

Considering the refactoring planned in D23418, i've no more comments on the part of the code that isn't rewritten in the subsequent review. I like how the new hashing system is put to work here.

lib/Analysis/CloneDetection.cpp
565 ↗(On Diff #67715)

Hmm. FoldingSetNodeIDWrapper? Because in fact we're wrapping neither a set, nor a node, but only an ID.

576 ↗(On Diff #67715)

// end anonymous namespace?

teemperor marked 2 inline comments as done.Aug 16 2016, 6:13 AM
teemperor updated this revision to Diff 68170.Aug 16 2016, 6:15 AM
  • Renamed FoldingSetWrapper to FoldingSetNodeIDWrapper
  • Added missing // end anonymous namespace
teemperor planned changes to this revision.Aug 16 2016, 7:34 AM
  • As the hash_stream patch will take a while to land upstream, we will change this to use FoldingSetNodeID and change it to hash_stream once that landed.
v.g.vassilev accepted this revision.Aug 16 2016, 8:09 AM
v.g.vassilev edited edge metadata.

LGTM.

teemperor updated this revision to Diff 68632.Aug 18 2016, 4:15 PM
teemperor edited edge metadata.
  • Moved from hash_stream to LLVM's MD5 implementation.
This revision is now accepted and ready to land.Aug 18 2016, 4:15 PM
NoQ added inline comments.Aug 20 2016, 3:19 AM
lib/Analysis/CloneDetection.cpp
550 ↗(On Diff #68632)

Hmm, this still uses hash_stream.

teemperor updated this revision to Diff 68778.Aug 20 2016, 5:49 AM
  • Replaced hash_stream with MD5 in the sub-sequence code.
teemperor marked an inline comment as done.Aug 20 2016, 5:49 AM
This revision was automatically updated to reflect the committed changes.