Page MenuHomePhabricator

[clang-diff] Initial implementation.
ClosedPublic

Authored by johannes on Jun 18 2017, 9:08 AM.

Details

Summary

So far, I have implemented the gumtree algorithm https://github.com/GumTreeDiff/gumtree/ to match AST nodes in the source file with their equivalents in the destination files. It combines a heuristic top-down search with an optimal algorithm for small subtrees

The clang-diff tool will mimic the output of the gumtree textual diff, that is, it will print matched nodes (identified by their postorder offset) and an edit script consisting of insertions/deletions.
Note that every node is matched separately (even if large subtrees are equivalent), hence it is rather verbose.

clang-diff src.cpp dst.cpp

The trees that are used to create the matchings and the edit script can be serialized to JSON. I use this to test it by comparing the matchings and the edit script to the ones generated by a prototype implementation.

clang-diff -ast-dump src.cpp

Future features:

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
arphaman added inline comments.Jun 20 2017, 4:22 AM
test/Tooling/clang-diff-basic.cpp
4 ↗(On Diff #103163)

Why do you need two invocations of clang-diff with the same arguments?

johannes updated this revision to Diff 103208.Jun 20 2017, 8:13 AM
johannes marked 7 inline comments as done.
johannes marked an inline comment as done.Jun 20 2017, 8:20 AM

Looking at the output of the tool, I have the following suggestion:

  • We should avoid implicit expressions (We don't need to see things like Insert ImplicitCastExpr(21) into BinaryOperator: *(22) at 0). This can be done in a follow-up patch though.

Ok, I will include that and other features in future patches.

lib/Tooling/ASTDiff/ASTDiff.cpp
474 ↗(On Diff #103163)

I think they could be modified for minor improvements of the matching, but I am probably not going to do so anytime soon. Maybe it is better to store them at class scope as static const / constexpr?

693 ↗(On Diff #103163)

Not sure if I got this right.

test/Tooling/clang-diff-basic.cpp
3 ↗(On Diff #103163)

Yes, much better!

4 ↗(On Diff #103163)

That was unintentional, I removed it.

arphaman added inline comments.Jun 20 2017, 8:27 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
474 ↗(On Diff #103163)

Yeah, I suppose for now they should be declared as constants (the static const / constexpr doesn't matter). They should probably be documented as well.

693 ↗(On Diff #103163)

It looks better, thanks. Generally we prefer to use braces for if/for/etc. if they have more than one statement.

arphaman added inline comments.Jun 20 2017, 8:30 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
303 ↗(On Diff #103208)

What's the difference between SNodeId and NodeId?

arphaman added inline comments.Jun 20 2017, 8:35 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
171 ↗(On Diff #103208)

I believe that this method that you call label actually represents the value attribute that's described in the paper for the gumtree algorithm. Is that right? If so, then this method name should be updated to reflect that.

johannes added inline comments.Jun 20 2017, 8:38 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
303 ↗(On Diff #103208)

NodeId is the preorder offset inside the root tree, starting at 0.
SNodeId is the postorder offset within a subtree, starting at 1.
Not sure if this irregularity can be improved upon easily, but I guess I can mention it in the comments.

arphaman added inline comments.Jun 20 2017, 8:41 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
303 ↗(On Diff #103208)

Ok. It's fine to have two different types for them, but to avoid confusions I think that the code should be constructed in such a way that prohibits passing in NodeId to a function that expects SNodeId and vice-versa. You can wrap SNodeId in a typedef with an explicit constructor to achieve that goal, e.g.:

struct SNodeId {
  int Id;

  explicit SNodeId(int Id) : Id(Id) { };
};
johannes added inline comments.Jun 20 2017, 8:50 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
171 ↗(On Diff #103208)

Yes, good catch. Strangely, the gumtree implementation uses label. I think we should use type for node kinds and value for their actual value, in order to avoid confusion about what a label is.

johannes updated this revision to Diff 103320.Jun 20 2017, 10:54 PM
  • Fix a bug in getSimilarity()
  • Change terminology: label -> value
  • Define SNodeId: Now it cannot be implicitly constructed from an int, however it can be converted to int. Still feels a bit weird
  • Fix some issues with SNodeId
  • Rewrite the computation of leftmost descendants in subtrees.
arphaman added inline comments.Jun 21 2017, 12:28 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
156 ↗(On Diff #103320)

Please use a more descriptive name e.g. NumLeaves.

157 ↗(On Diff #103320)

you should be able to use auto instead of std::function here I think.

209 ↗(On Diff #103320)

Qualifiers (i.e. NestedNameSpecifiers) can contain multiple identifiers (e.g. foo::bar::). You should use the print method from NestedNameSpecifier instead.

226 ↗(On Diff #103320)

This Value will always be "", right? You can just return "" here then.

755 ↗(On Diff #103320)

Looking at the unpacking here it's probably better to make Change a struct with named fields instead of a tuple. You would still be able to use emplace to create the changes though.

759 ↗(On Diff #103320)

You can just get rid of the intermediate S here and print directly to OS in all cases, e.g.:

OS << "Delete " << T1.showNode(Id1) << "\n"

Reviewing this mainly from the API view, and leaving the technical details to others :)

include/clang/Tooling/ASTDiff/ASTDiff.h
124 ↗(On Diff #103320)

Generally, can we put the public interface first in the class?
Usually folks read a source file top-down, so if the most relevant pieces of code are at the top, the source file gets easier to read and understand.

Thus:

  1. try to keep clutter (like classes and typedefs not used by a user of the API) in a different header, or forward declare them and define them in the .cpp file
  2. sort things into public first, then private in classes.
johannes updated this revision to Diff 103409.Jun 21 2017, 9:35 AM
johannes marked 4 inline comments as done.
  • move some unnecessary things out of the public header

Is this a proper way to declutter the header file? Using inheritance would also be possible.

I have to define a destructor for ASTDiff because TreeComparator is forward-declared

lib/Tooling/ASTDiff/ASTDiff.cpp
157 ↗(On Diff #103320)

It does not work because the function is recursive. I'm not sure whether it is good practise to do it like this.

209 ↗(On Diff #103320)

Ok nice, this works properly!
This function needs a lot of work. I try to extract relevant information from base classes first.

The API looks better IMHO. Some more comments:

include/clang/Tooling/ASTDiff/ASTDiff.h
57 ↗(On Diff #103409)

I think that it's better to make make NodeId a structure as well and make InvalidNodeId a private member. I suggest the following interface for NodeId:

struct NodeId {
private:
  static const int InvalidNodeId; 
public:
  int Id;

  NodeId() : Id(InvalidNodeId) { }
  NodeId(int Id) : Id(Id) { }
  
  bool isValid() const { return Id != InvalidNodeId; }
  bool isInvalid() const { return Id == InvalidNodeId; }
};

This way you'll get rid of a couple of variable initializations that use InvalidNodeId. You also won't need to call the memset when creating the unique pointer array of NodeIds.

lib/Tooling/ASTDiff/ASTDiff.cpp
33 ↗(On Diff #103409)

Mapping should use the C++ move semantics IMHO.

35 ↗(On Diff #103409)

This field is used only in TreeComparator, so you might as well move it there and rename to something like IsMappingDone.

318 ↗(On Diff #103409)

We don't really need this check. Let's use just one unreachable here.

638 ↗(On Diff #103409)

I think this function would would be clearer and faster (you'll be able to avoid redundant work) if you use early exists for the main conditions of the return, e.g.:

if (M.hasSrc(Id1) || M.hasDst(Id2)) return false; // Both nodes must not be mapped.

And so on.

johannes updated this revision to Diff 103432.Jun 21 2017, 11:46 AM
johannes added inline comments.Jun 21 2017, 11:58 AM
include/clang/Tooling/ASTDiff/ASTDiff.h
57 ↗(On Diff #103409)

Ok, I did it like this.

Can I create a header file inside lib/Tooling/ASTDiff and include it from the public interface? This would help reduce the clutter.

Instead of NodeId we could also just use references / pointer types. I don't see any particularly good reason for choosing either one above the other.
I guess indices make it more obvious how to compute the number of descendants and such. On the other hand, when using reference types, there is less boilerplate to write for loops.

klimek added inline comments.Jun 22 2017, 1:31 AM
include/clang/Tooling/ASTDiff/ASTDiff.h
57 ↗(On Diff #103409)

No, but you can create a header ASTDiffInternal or somesuch next to the header in the public dir.

arphaman added inline comments.Jun 23 2017, 7:45 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
730 ↗(On Diff #103432)

Johannes, it seems to me that your implementation of the top-down portion of the GumTree algorithm doesn't use the an empty list A of candidate mappings that's described in the paper (and that you have in the Python prototype). Is that correct or am I missing something?

johannes added inline comments.Jun 23 2017, 9:56 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
730 ↗(On Diff #103432)

Yes, initially I implemented it as it is described in the paper, but then I realized that the list of candidate mappings will always stay empty, because the condition in the top-down algorithm in the paper on line 14 will never be true. Maybe I am mistaken here, but if t1 and t2 are isomorphic, then none of the descendants of t1 will be isomorphic to t2. I mean, the height of isomorphic trees must be equal, and the descendant does not have the same height. So to me this looks like an error in the paper, I probably should have communicated this.
What I did instead is, I had a look at the reference implementation, and what they do instead of using a list of candidate mappings is to just use a data structure for the mapping that allows multiple matches for each node.
After matching collecting all candidates this way, they extract the unambiguous matches and then sort the ambiguous matches by their parents' similarity.
AbstractSubtreeMatcher.java
GreedySubtreeMatcher.java
This seems to be a good solution, I plan to implement that in the future.

johannes added inline comments.Jun 26 2017, 3:57 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
730 ↗(On Diff #103432)

My bad, I misread the algorithm in the paper, of course the entire tree is searched for other isomorphic subtrees.
I will still stick to the way it is implemented in gumtree, it should be more efficient.

ruiu added a subscriber: ruiu.Jun 26 2017, 4:08 PM

I'd copy what Hal mentioned in other review thread for other GSoC project. You don't want to tag your patches with "[GSoC]" because it doesn't describe anything about patch contents and many other unrelated patches could have been tagged as single "[GSoC]" tag. Instead, you want to use some other tag that describe your patch. Looks like you are adding a new tool, so I'd guess that "[clang-diff]" could be a good tag.

johannes updated this revision to Diff 104124.Jun 27 2017, 3:27 AM
johannes retitled this revision from [GSoC] Clang AST diffing to [clang-diff] Initial implementation..

Just to clarify: I'm fine with adding the

include/clang/Tooling/ASTDiff/ASTDiff.h
32 ↗(On Diff #104124)

Can you pass-in the options by value instead of a pointer?

Ignore the "Just to clarify: I'm fine with adding the" comment, it was from last week that was saved in my session and that I didn't delete.

arphaman added inline comments.Jun 27 2017, 5:43 AM
include/clang/Tooling/ASTDiff/ASTDiff.h
57 ↗(On Diff #104124)

If you want to use two different names then something like SyntaxTreeImpl is better imho

johannes updated this revision to Diff 104249.Jun 27 2017, 1:25 PM
  • pass Options as a const reference instead of a pointer
  • rename TreeComparator -> ASTDiff::Impl, rename Comparator -> DiffImpl
  • move declaration of ASTDiff::Impl from the header to the source file so that Options does not need to be an opaque pointer
  • rename SyntaxTreeI -> SyntaxTreeImpl
johannes updated this revision to Diff 104394.Jun 28 2017, 6:00 AM
  • remove unused struct
  • rename getNodeValueI -> getNodeValueImpl
johannes updated this revision to Diff 104395.Jun 28 2017, 6:01 AM
johannes updated this revision to Diff 104401.Jun 28 2017, 6:10 AM
arphaman added inline comments.Jun 29 2017, 8:32 AM
include/clang/Tooling/ASTDiff/ASTDiffInternal.h
184 ↗(On Diff #104401)

getNodeValueImpl?

lib/Tooling/ASTDiff/ASTDiff.cpp
439 ↗(On Diff #104401)

Do you know the reason for the ZsMatcher name? I noticed that they used it in the GumTree java implementation as well. If it's based on some paper you should add a comment that mentions it.

johannes updated this revision to Diff 104664.Jun 29 2017, 9:27 AM
johannes marked 2 inline comments as done.Jun 29 2017, 9:36 AM
johannes added inline comments.
lib/Tooling/ASTDiff/ASTDiff.cpp
439 ↗(On Diff #104401)

Ok, i forgot about that, it's Zhang and Shasha's algorithm for the tree edit distance, basically computing the Levenshtein distance for ordered trees.

I think that it's pretty much ready. I think that the test should be expanded though. At the very least it should check that all of the node types that are supported by SyntaxTreeImpl::getNodeValueImpl get matched.

teemperor edited edge metadata.EditedJun 30 2017, 4:33 AM

I didn't have time to have a close look at this patch, but it seems you're interested in the specific TU-independent data of a Stmt to compare them. So if you are interested in such data and don't want to write your own function to collect it for each Stmt subclass, there is the StmtDataCollector in the CloneDetection.cpp here and a example how to use it is here.

I'm fine with moving this into a header and make it more usable for this use case if you think it makes sense to use it here. The main advantage would be that we don't get yet another of these classes in clang (we already have 3 of them: One in the Stmt profiler code, another one in the ODRHash code for modules and another one in the CloneDetection code).

johannes marked an inline comment as done.Jun 30 2017, 5:55 AM

I didn't have time to have a close look at this patch, but it seems you're interested in the specific TU-independent data of a Stmt to compare them. So if you are interested in such data and don't want to write your own function to collect it for each Stmt subclass, there is the StmtDataCollector in the CloneDetection.cpp here and a example how to use it is here.

I'm fine with moving this into a header and make it more usable for this use case if you think it makes sense to use it here. The main advantage would be that we don't get yet another of these classes in clang (we already have 3 of them: One in the Stmt profiler code, another one in the ODRHash code for modules and another one in the CloneDetection code).

Looking at this again now, it seems to make sense to consolidate this. It collects a lot of data I currently don't.
Does it include the values of literals / identifiers? It seems not (since it is geared towards detecting structural clones only). So we could make that optional, or I could do that myself.

Yes, it does indeed skip identifiers and literals for this reason :). It was planned to make this template more configurable for use cases like yours, so I'm totally fine with adding configuration parameters. I just opened D34880 where I make this template public as a first step.

johannes updated this revision to Diff 105059.Jul 3 2017, 6:27 AM
johannes edited the summary of this revision. (Show Details)

@johannes
Are you planning to work on integration with the StmtDataCollector in this patch or would you prefer to follow-up with additional patches?

@johannes
Are you planning to work on integration with the StmtDataCollector in this patch or would you prefer to follow-up with additional patches?

Later would be better

arphaman added inline comments.Jul 4 2017, 3:19 AM
lib/Tooling/ASTDiff/ASTDiff.cpp
385 ↗(On Diff #105059)

What's the purpose of the S prefix in the name of this method and a couple of other methods below? Can you replace it with something more descriptive or remove it altogether?

arphaman added inline comments.Jul 4 2017, 3:29 AM
include/clang/Tooling/ASTDiff/ASTDiffInternal.h
38 ↗(On Diff #105059)

NIT: You don't need this here or in -- as well.

lib/Tooling/ASTDiff/ASTDiff.cpp
359 ↗(On Diff #105059)

NIT: This ';' is redundant.

363 ↗(On Diff #105059)

NIT: You don't need 'this' here or in the 2 methods below.

809 ↗(On Diff #105059)

Missing assertion message.

johannes updated this revision to Diff 105161.Jul 4 2017, 5:44 AM
  • style fixes
  • correct getSimilarity()
johannes marked 5 inline comments as done.Jul 4 2017, 5:45 AM
This revision is now accepted and ready to land.Jul 4 2017, 5:47 AM

@johannes D34880 has landed, so feel free to propose patches to the StmtDataCollector API that would help you (e.g. to support identifiers). You can see examples how to use it in the CloneDetection.cpp (once for storing data in a FoldingetSetNodeID and once for directly hashing the data with MD5).

I'll commit this on behalf of Johannes today as he didn't get his access yet

This revision was automatically updated to reflect the committed changes.

@arphaman @johannes Is that normal that clang-diff isn't installed by cmake? (like clang-format?)

kadircet removed a subscriber: kadircet.Aug 26 2018, 8:05 AM

@arphaman @johannes Is that normal that clang-diff isn't installed by cmake? (like clang-format?)

Yes, we did not add that. I don't know if anyone would use it.