This is an archive of the discontinued LLVM Phabricator instance.

Added Fixer implementation and fix() interface in clang-format for removing redundant code.
ClosedPublic

Authored by ioeric on Mar 29 2016, 7:22 AM.

Details

Summary

After applying replacements, redundant code like extra commas or empty namespaces
might be introduced. Fixer can detect and remove any redundant code introduced by replacements.
The current implementation only handles redundant commas.

Diff Detail

Repository
rL LLVM

Event Timeline

ioeric updated this revision to Diff 51905.Mar 29 2016, 7:22 AM
ioeric retitled this revision from to Added Fixer implementation and fix() interface in clang-format for removing redundant code..
ioeric updated this object.
ioeric added a reviewer: djasper.
ioeric added a subscriber: cfe-commits.
ioeric added inline comments.Mar 31 2016, 2:05 AM
include/clang/Format/Format.h
813 ↗(On Diff #51905)

I am adding reformatting after fixing, and I am wondering if we can get rid of the Style parameter here (since the style does not really matter in code fixing, I think?) and let the reformat handling the style with the use case like:

FinalReplaces = formatReplacements(Code, fixReplacements(Code, Replaces), Style);

where fixReplacements is a function parallel to formatReplacements.

ioeric updated this revision to Diff 52199.Mar 31 2016, 6:41 AM
  • Added fixReplacements() that fix and reformat replacements.
ioeric updated this revision to Diff 52547.Apr 4 2016, 6:26 AM
  • Added fix for empty namespace.
  • Merge multiple continuous token deletions into one big replacement; minor code styling.
  • refactored code to reduce redundancy.
djasper added inline comments.Apr 4 2016, 12:37 PM
include/clang/Format/Format.h
770 ↗(On Diff #52547)

I think we should not automatically format the replacements after fixing. This function can easily be combined with formatReplacements if that is desired.

802 ↗(On Diff #52547)

This sentence doesn't make sense.

803 ↗(On Diff #52547)

I believe the style will become important as certain aspects of fixing are going to be controlled via the style. E.g. some styles might want to turn:

if (a)
  return;

into:

if (a) {
  return;
}
lib/Format/Format.cpp
1446 ↗(On Diff #52547)

I'd probably move all of these out into a class so you don't need to pass the same set of parameters between all of them and so that you can make those private that aren't meant to be called directly.

1596 ↗(On Diff #52547)

I am not sure, this is the right class to pull out. It still has a lot of overlap with formatter. Maybe it is instead a better idea to pull out each fixer into its own class and pull them into the formatter.

1651 ↗(On Diff #52547)

The Tokens parameter is unused!?

1654 ↗(On Diff #52547)

I am not (yet) convinced that the fundamental design here makes sense. So far it seems to me that all fixers will either operate within a single line (ctor fixer and probably most others) or that they do something specific on multiple lines (namespace fixer and empty public/private section fixer and maybe others). Creating a new way to iterate over all tokens of all lines doesn't really add useful functionality here AFAICT.

1657 ↗(On Diff #52547)

No braces. Here and at all other single-statement ifs.

1690 ↗(On Diff #52547)

Having a function called "reset" that is only called once is weird. Just do this in the constructor?

1748 ↗(On Diff #52547)

Comment on what this is doing.

1749 ↗(On Diff #52547)

Ed? Either use E or End.

1891 ↗(On Diff #52547)

Why is this needed?

1909 ↗(On Diff #52547)

A lot of these need documentation. What's the current line? What is the last token (last token that was read? last token of line? last token of file?). What are RedundantTokens?

2356 ↗(On Diff #52547)

Too much code duplication. Find a way to avoid that (passing in the calls to either reformat() or fix() as lambdas to a shared internal function).

unittests/Format/FormatTest.cpp
11236 ↗(On Diff #52547)

All of this should go into a different file.

ioeric updated this revision to Diff 52676.Apr 5 2016, 5:25 AM
ioeric marked 14 inline comments as done.Apr 5 2016, 5:25 AM
  • Refactored the code to reduce code duplication. Code styling. Moved test cases for Fixer into a new file FixTest.cpp.
  • Added RangeManager to manage affected ranges. make empty namespace fixer delete tokens not in an affected line.
lib/Format/Format.cpp
1596 ↗(On Diff #52547)

But I'm not quite sure if implementing the fixer in formatter works better since they look like two parallel classes to me. And it might make the formatter more complicated to use if we mix it with fixer. I think adding another layer of abstraction might be one way to reduce duplication?

1654 ↗(On Diff #52547)

I've also considered the way formatter does, which iterates over lines. But it seems to be easier to write fixer, especially multi-line fixer, if we can just ignore lines. Working with lines seems to add another layer of complexity here?

djasper added inline comments.Apr 5 2016, 10:20 PM
lib/Format/Format.cpp
1654 ↗(On Diff #52676)

That I don't understand. Almost all fixers (cleaning up commas, constructor initializers, etc.) will only ever look at a single line.

The matchers that do make multiline fixes (empty namespaces, empty public/private sections) only look at complete lines, i.e. only ever use AnnotatedLine::startsWith(...) and don't iterate over the tokens, AFAICT.

I really think that giving the option to iterate over all tokens of multiple lines does more harm than good. Among several other things, error recovery is made harder. It never makes sense for the ctor-initializer fixer to leave its line (or else you'll just ignore some nice error recovery already implemented in the UnwrappedLineParser).

1693 ↗(On Diff #52676)

I have a comment here (and in some other places), but I think this will be different anyway if we move to line-based fixers. So I am holding off with those comments for now.

1718 ↗(On Diff #52676)

Just go to CurrentToken->MatchingParen!?

1824 ↗(On Diff #52676)

I am happy to show you how the implementation here gets much simpler if you only iterate over AnnotatedLines.

1916 ↗(On Diff #52676)

I think "redundant" is the wrong word here. How about DeletedTokens?

ioeric updated this revision to Diff 52832.Apr 6 2016, 12:10 PM
ioeric marked 4 inline comments as done.
  • Change implementation of fixer to iterate by lines. TODO: refactor Formatter and Fixer to reduce duplication.
djasper added inline comments.Apr 8 2016, 2:09 PM
include/clang/Format/Format.h
769 ↗(On Diff #52832)

I am actually not sure that fixReplacements is the right terminology here (sorry that I haven't discovered this earlier). It isn't really fixing the replacements (they aren't necessarily broken). How about cleanupAroundReplacements? Manuel, what do you think?

lib/Format/Format.cpp
1596 ↗(On Diff #52832)

But like this, it is way to much code duplication for me, both in number of lines of code as well as code complexity. Basically everything in fix() is duplicated along with most of the class members etc. The main difference seems to be that runFix() is called instead of format. There must be a way to have only one implementation of that code.

I think (but cannot see all the consequences yet) you should merge the two classes. And if that leads to too much complexity, you can probably pull out runFix and everything it calls into a separate class as well as possibly format() and everything that format calls.

1597 ↗(On Diff #52832)

Why are you doing this?

1664 ↗(On Diff #52832)

Can you move the ConstructorInitializer fixer and everything that belongs to it to a different patch? I want to look at that in more detail and two smaller patches make this easier for me.

1672 ↗(On Diff #52832)

No need for this boolean. Write this as:

for (FormatToken *Tok = Line.First;; Tok = Tok->Next) {
  if (!Tok->is(tok::comment))
    return false;
}
return true;

And maybe we should (in a separate patch) add an iterator interface to AnnotatedLine so that we can use range-based for loops.

1695 ↗(On Diff #52832)

Why not return -1 here?

1699 ↗(On Diff #52832)

Move the startsWith into containsOnlyComments.

1701 ↗(On Diff #52832)

I'd write this in a very different order:

if (AnnotatedLines[CurrentLine]->startsWith(tok::r_brace))
  break;
if (AnnotatedLines[CurrentLine]->startsWith(kw_namespaces) || ..) {
  int NewLine = checkEmptyNamespace(CurrentLine);
  if (NewLine == -1)
    return -1;
  continue;
if (!containsOnlyComments(..))
  return -1;
1702 ↗(On Diff #52832)

It's a bit subtle that this works correctly. At least I needed to think quickly why you wouldn't falsely run into this in

namespace N {
void f() {
}
}

But of course, then the namespace wouldn't be empty. Please leave a comment explaining this briefly.

1717 ↗(On Diff #52832)

Hm, this seems very inelegant. If you have:

namespace A {
namespace B {
namespace C {
}
}
}

"namespace C {" and "}" gets delete three times. At the very least you should add

if (AnnotatedLines[i]->Deleted)
  continue;

Thinking about this some more, it will get deleted even more often has even the outer loop in runFix will call checkEmptyNamespace again for the inner namespaces even though they have been deleted already. So inner namespaces will bei deleted O(N^2) times if N is the nesting depth. Lets get that down to 1 instead ;-)

1818 ↗(On Diff #52832)

Can you explain why you are doing this?

2286 ↗(On Diff #52832)

Don't define a typedef for something that is only used once. Also, this is an internal function, how about writing this as:

template <typename T>
static tooling::Replacements
processReplacements(StringRef Code, const tooling::Replacements &Replaces, T ProcessFunc,
                    const FormatStyle &Style) {

}

No need to spell out the function type (I think).

ioeric updated this revision to Diff 53185.Apr 10 2016, 3:24 PM
ioeric marked 10 inline comments as done.
  • Moved constructor initializer fixer to a separate patch; pull runFixer and runFormat into separate classes, and merge common code in CodeProcessor class.
ioeric updated this revision to Diff 53202.Apr 11 2016, 1:06 AM
  • Fixed a potential bug in checkEmptyNamespace.
ioeric added inline comments.Apr 11 2016, 1:09 AM
lib/Format/Format.cpp
1590 ↗(On Diff #53202)

I'm not quite sure if the name is accurate. Do you have a better name?

1597 ↗(On Diff #52832)

I am not quite sure here actually. This was copied from Formatter. And since consumeUnwrappedLine requires at lease one element in UnwrappedLines, I figured this is necessary initialization?

1818 ↗(On Diff #52832)

I want to reduce the number of replacements so that when we do reformat on the fixed code, there could be fewer changed code ranges considering that the current implementation of computeAffectedLines is not that efficient when we have many ranges.

2286 ↗(On Diff #52832)

Wow, this is really helpful! Thanks!

djasper added inline comments.Apr 11 2016, 4:57 AM
include/clang/Format/Format.h
800 ↗(On Diff #53202)

Same as above, "fix" is probably not the right word. "cleanup" seems somewhat better. Here and everywhere else in this patch :-/.

lib/Format/Format.cpp
1446 ↗(On Diff #53202)

Maybe call this AffectedRangeManager? That is the class's very specific purpose.

1590 ↗(On Diff #53202)

I think the difficulty of finding a name somewhat stems from the fact that it does multiple things at once. I think the main purpose is to encapsulate the "flattening" of the different #if/#else branches in successive runs.

Manuel, do you have an idea for a good name here?

1607 ↗(On Diff #53202)

It feels like these two functions increase the coupling between Formatter/Fixer and CodeProcesser and aren't really necessary. Can you just inline them at the locations where they are called?

1669 ↗(On Diff #53202)

Lets not have this access. Provide getters.

1908 ↗(On Diff #53202)

I see why you are doing it this way, but let me propose something else:

How about iterating over all of the lines for each thing to cleanup (namespaces, ctor-intializers, etc.). I don't think it makes a significant difference in the runtime. Then you can just skip all the lines that you have already looked at (i.e. the returned int).

Among other things, you can probably remove the duplication of the AnnotatedLines[CurrentLine]->startsWith(..) and maybe even the AnnotatedLine::Deleted field.

lib/Format/FormatToken.h
283 ↗(On Diff #53202)

Do you need this? Wouldn't it be enough to add them to the set of deleted tokens and mark them as Finalized?

ioeric updated this revision to Diff 53268.Apr 11 2016, 10:03 AM
ioeric marked 16 inline comments as done.
  • Addressed comments.
ioeric updated this revision to Diff 53274.Apr 11 2016, 10:26 AM
  • removed unused fields in AnnotatedLine.
ioeric updated this revision to Diff 53370.Apr 12 2016, 12:51 AM
  • minor changes in checkEmptyNamespace.
djasper added inline comments.Apr 12 2016, 10:33 PM
lib/Format/Format.cpp
1677 ↗(On Diff #53370)

I think FormatStyle in the Processor should be const and Formatter should make its own copy to modify it.

1926 ↗(On Diff #53370)

Why are you introducing a return value that you aren't using? I'd just return the number of subsequent lines to skip.

Overall, I'd do this a bit differently:

  • Let it check the namespace starting at CurrentLine
  • Delete empty namespaces within
  • If it is empty, delete the outer namespace
  • Always return how many lines it has looked at

That way, you don't need to store DeletedLines and you don't ever look at the same line multiple times.

1927 ↗(On Diff #53370)

What about

namespace A { // really cool namespace

?

1987 ↗(On Diff #53370)

I believe that this doesn't work in many circumstances. Not sure I can easily construct a test case for the namespace cleanup, but generally, AnnotatedLines are not in a strict order. The last token of a line isn't necessarily next to the first token of the next line. This happens, e.g. for nested blocks and if preprocessor lines are intermingled with other lines.

1990–1993 ↗(On Diff #53370)

I think this should hardly matter considering that we are probably only going to have a few instances of cleanup replacements and the range calculation is not that inefficient.

2224 ↗(On Diff #53370)

Do we also need that if we spell out the ProcessFunc above? Maybe that's actually less painful compared to writing these lambdas?

ioeric updated this revision to Diff 53866.Apr 15 2016, 3:01 AM
ioeric marked 6 inline comments as done.
  • Do not merge multiple lines when generate fixes. Added test cases with comments around namespace.
  • Refactor: pull Annotator into Formatter/Cleaner from CodeProcessor. Moved AffectedRangeManager to a separate file.
klimek added inline comments.Apr 18 2016, 8:28 AM
include/clang/Format/Format.h
769 ↗(On Diff #53866)

cleanupAroundReplacements sounds good.

lib/Format/Format.cpp
1466 ↗(On Diff #53866)

Nit: I think it's idiomatic to call this a "Callback".

2143 ↗(On Diff #53866)

This is unexpected: I'd not have expected the CodeProcessor to be used to keep state apart from what's used during a Process() run. I think we'll need a different name.

2176 ↗(On Diff #53866)

The way the Processor is passed into the callbacks so they can call stuff on the processor again makes me think this is really tightly coupled and should use inheritance.

Alternatively, we might want to try to break up the CodeProcessor into a part that just runs the callback and keeps the state for running the callback around, and an interface that is used by the callbacks - I'll need to look more closely into this though.

klimek added inline comments.Apr 18 2016, 8:37 AM
lib/Format/Format.cpp
1450 ↗(On Diff #53866)

After pondering a bit more, it seems like CodeProcessor is too generic: this is more something that runs analyses based on annotated tokens, so I'd call it TokenAnalyzer or something. That would also make it obvious to push some of the duplicated functionality from Formatter and Cleaner down (namely everything up to where we have generated the annotated tokens).

I'm still not completely convinced either way regarding inheritance vs. composition, but if we have a TokenAnalyzer, we'd also have a nice is-a relationship. While I generally agree to prefer composition, I also think we need to look out for where inheritance actually makes things clearer.

ioeric updated this revision to Diff 54233.Apr 19 2016, 11:32 AM
ioeric marked 6 inline comments as done.
  • Rebased
  • Make Formatter and Cleaner inherit from TokenAnalyzer (new name for CodeProcessor).
ioeric updated this revision to Diff 54476.Apr 21 2016, 2:48 AM
  • Added comments for endsWithInternal().
djasper added inline comments.Apr 22 2016, 3:20 AM
lib/Format/Format.cpp
1544 ↗(On Diff #54476)

Move this into the base class?

2097 ↗(On Diff #54476)

Instead, create a class Environment that does all of these and makes SourceMgr, ID and CharRanges available via getters (maybe more). Then you should be able to just instantiate an environment and call the other reformat/cleanup function with the corresponding arguments.

ioeric updated this revision to Diff 54837.Apr 25 2016, 5:05 AM
  • Refactored - added Environment class.
ioeric updated this revision to Diff 54848.Apr 25 2016, 7:19 AM
  • Merged VirtualEnvironment into Environment.
klimek accepted this revision.Apr 25 2016, 7:56 AM
klimek added a reviewer: klimek.

We'll probably want Daniel to also take another look over it, as it's a pretty substantial change that will haunt us for a while, but I think this now pretty much looks like I expect it to look.

lib/Format/Format.cpp
1509 ↗(On Diff #54837)

Don't name it SourceMgr, as there is a class of that name. Unfortunately we'll probably want to name it SM, like everywhere else in clang :(

1454 ↗(On Diff #54848)

Won't deleting this line have the same effect?

1489–1490 ↗(On Diff #54848)

There is no base class any more.

1872 ↗(On Diff #54848)

In LLVM TODOs are spelled FIXME.

This revision is now accepted and ready to land.Apr 25 2016, 7:56 AM
ioeric updated this revision to Diff 54855.Apr 25 2016, 8:13 AM
ioeric marked 4 inline comments as done.
ioeric edited edge metadata.
  • Addressed comments.
This revision was automatically updated to reflect the committed changes.

@Daniel, sorry that I forgot to have you look at the final version before
submitting it...

djasper added inline comments.Apr 25 2016, 9:19 PM
cfe/trunk/lib/Format/AffectedRangeManager.h
59

And an empty line between functions and data members here.

66

Fix header guard comment

cfe/trunk/lib/Format/Format.cpp
1355

What happened here?

1463

Hand in CharRanges as ArrayRef or const vector&.

1471

Why don't you do this in the constructor? Seems asymmetric to just call a constructor in the one codepath but a factory function in the other one.

1519

Would it be sufficient to return a const SourceManager&? I guess not, but I'd like to understand where it breaks. If we could return a const source manager here, that would enable us to pass around a "const Environment&" at a few places.

ioeric marked 6 inline comments as done.Apr 27 2016, 6:02 AM
ioeric added inline comments.
cfe/trunk/lib/Format/Format.cpp
1355

This line exceeded 80 characters and was formatted by clang-format when I ran clang-format across it...but I guess this change was out of the scope of this patch...sorry about that.

1471

We have a reference member "SourceManager &SM" that can only be initialized after setting up the environment...I didn't know if constructor would work in this case.

lib/Format/Format.cpp
1544 ↗(On Diff #54476)

The way we calculate affected ranges are the same for Cleaner and Formatter. But in the future, we might want more accurate affected ranges for Cleaner.