clang-format: consider not splitting tokens in optimization
Needs ReviewPublic

Authored by Typz on May 26 2017, 3:11 AM.

Details

Reviewers
krasimir
djasper
Summary

This patch tries to improve the optimizer a bit, to avoid splitting
tokens (e.g. comments/strings) if only there are only few characters
beyond the ColumnLimit.

Previously, comments/strings would be split whenever they went beyond
the ColumnLimit, without consideration for the PenaltyBreakComment/
String and PenaltyExcessCharacter. With this patch, the
OptimizingLineFormatter will also consider not splitting each token,
so that these 'small' offenders get a chance:

// With ColumnLimit=20
int a; // the
       // comment

// With ColumnLimit=20 and PenaltyExcessCharacter = 10
int a; // the comment

This patch does not fully optimize the reflowing, as it will only try
reflowing the whole comment or not reflowing it at all (instead of
trying each split, to allow some lines of overflow ColumnLimit even
when reflowing).

Typz created this revision.May 26 2017, 3:11 AM
Typz updated this revision to Diff 100379.May 26 2017, 3:15 AM

fix indentation issues

Typz added a comment.EditedMay 26 2017, 3:25 AM

This seems to work fine, but may not find the 'optimal' reflow, thus I would need some feedback:

  • Is this approach desirable, as a relatively easy fix?
  • Or should this be fixed with a complete refactoring of the way the strings/comments are split, making multiple tokens out of them to let them be reflown 'directly' by the optimizing line formatter? (keep the optimizer, but changes the whole way comments are handled, and need to move all the information about the comment block and its relfowing into LineState)
  • Or should the breakProtrudingToken() actually perform an optimization of the split of this token? (keeps the same overall structure, but requires writing another optimizer)
unittests/Format/FormatTest.cpp
9838

This is not working as expected, format return:

int a; /* first line
          * second *
          line third
          * line */

Any clue how things could go so wrong?

djasper edited reviewers, added: alexfh; removed: djasper.May 26 2017, 4:46 AM
Typz marked an inline comment as done.May 30 2017, 5:25 AM
Typz added inline comments.
unittests/Format/FormatTest.cpp
9838

This was a bug in tests: should not use test::messUp() with multi-line comments.

Typz updated this revision to Diff 100698.May 30 2017, 5:27 AM
Typz marked an inline comment as done.

fix code & tests

I think that what you're trying to solve is not practically that important, is unlikely to improve the handling of comments, and will add a lot of complexity.

From a usability perspective, I think that people are happy enough when their comments don't exceed the line limit. I personally wouldn't want the opposite to happen. I've even seen style guides that have 80 columns limit for comments and 100 for code.

Comments are treated in a somewhat ad-hoc style in clang-format, which is because they are inherently free text and don't have a specific format. People just expect comments to be handled quite differently than source code. There are at least three separate parts in clang-format that make up the formatting of comments: the normal parsing and optimization pipeline, the BreakableLineCommentSection / BreakableBlockComment classes that are responsible for breaking and reflowing inside comment sections, and the consecutive trailing comment alignment in the WhitespaceManager. This patch takes into account the first aspect but not the consequences for the other aspects: for example it allows for the first line comment token in a multiline line comment section to get out of the column limit, but will reflow the rest of the lines. A WhitespaceManager-related issue is that because a trailing line comment for some declaration might not get split, it might not be aligned with the surrounding trailing line comments, which I think is a less desirable effect.

Optimizing the layout in comment sections by using the optimizing formatter sounds good, but because comments mostly contain free text that can be structured in unexpected ways, I think that the ad-hoc approach works better. Think of not destroying ASCII art and supporting bulleted and numbered lists for example. We don't really want to leak lots of comment-related formatting tweaks into the general pipeline itself.

I see you point that PenaltyBreakComment and PenaltyExcessCharacter are not taken into account while comment placement, but they are taken into account at a higher level by treating the whole comment section as a unit. For example, if you have a long declaration that has multiple potential split points followed by a long trailing comment section, the penalty induced by the number of lines that are needed and the number of unbroken characters for the formatting of the comment section is taken into account while optimizing the layout of the whole code fragment.

The formatted currently supports somewhat limited version of allowing comment lines exceeding the column limit, like long hyperlinks for example. I think that if there are some other examples which are both widespread and super annoying, we may consider them separately.

Typz added a comment.EditedJun 6 2017, 4:59 AM

I think that what you're trying to solve is not practically that important, is unlikely to improve the handling of comments, and will add a lot of complexity.

Not sure the 'approach' I have in this patch is nice, but it does solve my problem: it is quite simple, and avoids reflowing (and actually, sometimes breaking, since comments are not so structured...) the comments.
e.g. if I have a line followed by a relatively long comment, with my patch it will really consider the penaltys properly, and not split a line which is slightly longer [with ColumnLimit=30] :

int a; //< a very long comment

vs

int a; //< a very long
       //< comment

From a usability perspective, I think that people are happy enough when their comments don't exceed the line limit. I personally wouldn't want the opposite to happen. I've even seen style guides that have 80 columns limit for comments and 100 for code.

That is a matter of taste and coding style: I prefer overflowing by a few characters instead of introducing an extra line... I see no reason to allow customizing PenaltyExcessCharacter yet not use this value to make the decision.

Comments are treated in a somewhat ad-hoc style in clang-format, which is because they are inherently free text and don't have a specific format. People just expect comments to be handled quite differently than source code. There are at least three separate parts in clang-format that make up the formatting of comments: the normal parsing and optimization pipeline, the BreakableLineCommentSection / BreakableBlockComment classes that are responsible for breaking and reflowing inside comment sections, and the consecutive trailing comment alignment in the WhitespaceManager. This patch takes into account the first aspect but not the consequences for the other aspects: for example it allows for the first line comment token in a multiline line comment section to get out of the column limit, but will reflow the rest of the lines. A WhitespaceManager-related issue is that because a trailing line comment for some declaration might not get split, it might not be aligned with the surrounding trailing line comments, which I think is a less desirable effect.

Not sure I understand the case you are speaking about, can you please provide an exemple?

(on a separate topic, I could indeed not find a penalty for breaking alignment; I think the optimizer should account for that as well... any hint on where to look for adding this?)

Optimizing the layout in comment sections by using the optimizing formatter sounds good, but because comments mostly contain free text that can be structured in unexpected ways, I think that the ad-hoc approach works better. Think of not destroying ASCII art and supporting bulleted and numbered lists for example. We don't really want to leak lots of comment-related formatting tweaks into the general pipeline itself.

Good, one less choice :-)

And I indeed agree optimizing lines 'individually' may be much worse than the current ad-hoc approach, unless we could depend on extra syntax in comments (like full understanding of markdown and doxygen syntax), but that is out of topic here.

But this patch does try to move away from the adhoc approach: i am just trying to allow *not* wrapping the comments when it gives better score, based on the PenaltyBreakComment and PenaltyExcessCharacter values.

I see you point that PenaltyBreakComment and PenaltyExcessCharacter are not taken into account while comment placement, but they are taken into account at a higher level by treating the whole comment section as a unit. For example, if you have a long declaration that has multiple potential split points followed by a long trailing comment section, the penalty induced by the number of lines that are needed and the number of unbroken characters for the formatting of the comment section is taken into account while optimizing the layout of the whole code fragment.

If you reduce PenaltyExcessCharacter you see that comments can never overflow: the optimizer will consider splitting the comments or splitting the remaining of the code, but will (currently) not allow the comment to overflow just a bit.

The formatted currently supports somewhat limited version of allowing comment lines exceeding the column limit, like long hyperlinks for example. I think that if there are some other examples which are both widespread and super annoying, we may consider them separately.

This patch does not try to address these special cases, and should not change the behavior in this case.

Typz added a comment.Jun 19 2017, 1:34 AM

@krasimir, @alexfh : can I get some feedback?

This patch solves a practical problem, i.e. allowing the comment to overflow a bit without triggering a reflow [according to the priorities which are in config, obviously]. It may not always provide the "best" wrapping, but as you pointed out comments are unstructured, so processing must indeed stay quite conservative: which is how this patch tries to behave.

Is there a better way to do this? or do you see something missing?

There are at least three separate parts in clang-format that make up the formatting of comments: the normal parsing and optimization pipeline, the BreakableLineCommentSection / BreakableBlockComment classes that are responsible for breaking and reflowing inside comment sections, and the consecutive trailing comment alignment in the WhitespaceManager. This patch takes into account the first aspect but not the consequences for the other aspects

You suggested there is an issue, but I could not isolate it. Can you provide an exemple?

alexfh edited reviewers, added: djasper; removed: alexfh.Jun 23 2017, 7:35 AM
alexfh added a subscriber: alexfh.Jun 23 2017, 7:37 AM

Daniel is a better reviewer here than myself. A few cents from me though: why do we want to make an exception for comments and not for regular code?

Typz added a comment.Jun 23 2017, 8:49 AM

why do we want to make an exception for comments and not for regular code?

This is not an exception for comments: the PenaltyExcessCharacter is used whenever the code is longer than the ColumnLimit, and used to compute the best solution.
The default value is extremely high, so there is no practical difference: code does not go beyond the column, except for specific exceptions.

However, the difference becomes apparent when reducing this penalty: in that case, code can break the column limit instead of wrapping, but it does not happen with comments. This patch tries to restore the parity, and let comments break the comment limit as well.

djasper added inline comments.Jul 4 2017, 1:51 AM
lib/Format/UnwrappedLineFormatter.cpp
723 ↗(On Diff #100698)

This is almost certainly a no-go. It turns the algorithm from exploring a state space with a branching factor of 2 into something with a branching factor of 4.

Even assuming we can perfectly reuse results from computations (or in other words hit the same state through different path using Dijkstra's algorithm), the additional bool in StateNode doubles the number of states we are going to visit. You don't even seem to make a distinction of whether the token under investigation can possibly be split. You do look at NoLineBreak(InOperand), but even if those are false, the vast majority of tokens will never be split.

However, even if you narrow that down, I am not convinced that fixing this inconsistency around the formatting penalties is worth the runtime performance hit. I am generally fine with changing the behavior in the way you are proposing, but only if it has zero (negative) effect on performance.

764 ↗(On Diff #100698)

It's not clear to me why you'd return here.

Typz added inline comments.Jul 13 2017, 5:17 AM
lib/Format/UnwrappedLineFormatter.cpp
723 ↗(On Diff #100698)

Making the distinction to skip some path is done at the beginning of addNextStateToQueue(), though indeed not very precisely at the moment.

I can improve the check (i.e. by copying all the early return conditions from the beginning of ContinuationIndenter::breakProtrudingToken(), which will greatly reduce the number of possible state, but stilll have a small impact.

The alternative would be modify ContinuationIndenter::breakProtrudingToken(), so that it computes the penalty for reflowing as well as not-reflowing, and updates the LineState with the best solution. It would certainly have an impact on performance, but would not increase the size of the state space.

Another issue with that approach is that the information is not passed from the DRYRUN phase to the actual rewriting phase. I could store this information in the LineState, to re-use it in the reconstruction phase, but this seems really wrong (and would work only in the exact case of the optimizing line formatter).

What do you think? Should I keep the same structure but reduce the number of explored state; or move the decision into ContinuationIndenter, possibly storing the result in LineState?

764 ↗(On Diff #100698)

This is needed to avoid trying to break when this is not supported: no point doing twice the same thing.

Typz marked 2 inline comments as done.Jul 13 2017, 6:30 AM
Typz added inline comments.
lib/Format/UnwrappedLineFormatter.cpp
723 ↗(On Diff #100698)

Actually, it seems it cannot be done inside ContinuationIndenter::breakProtrudingToken(), since this causes the whitespace manager to be called twice for the same token.

Typz updated this revision to Diff 106442.Jul 13 2017, 9:30 AM

Move code out of optimizer, directly into ContinuationIndenter::breakProtrudingToken(), to minimize impact on performance.
Comment reflowing of breakable items which break the limit will be slightly slower, since we now consider the two options; however this change has no impact on the performance of processing non-breakable items, and may actually increase the operformance of processing breakable items which do not need to be reflowed.

I have a slightly hard time grasping what this patch now actually does? Doesn't it simply try to decide whether or not to make a split locally be comparing the PenaltyBreakComment against the penalty for the access characters? If so, couldn't we simply do that as an implementation detail of breakProtrudingToken() without needing to let anything outside of it now and without introducing State.Reflow?

lib/Format/ContinuationIndenter.cpp
1462

Can you create a patch that doesn't move the code around so much? Seems unnecessary and hard to review.

1569

I believe that this is incorrect. reflowProtrudingToken counts the length of the unbreakable tail and here you just remove the penalty of the token itself. E.g. in:

string s = f("aaa");

the ");" is the unbreakable tail of the stringl

Typz added inline comments.Wed, Sep 13, 7:50 AM
lib/Format/ContinuationIndenter.cpp
1462

Moving the code around is an unfortunate requirement for this patch: we must actually do the "reflowing" twice, to find out which solution (actually reflowing or not) gives the least penalty.

Therefore the function that reflowing must be moved out of breakProtrudingToken()...

All I can do is try moving the new function after breakProtrudingToken(), maybe Phabricator will better show the change.

1569

This behavior has not changed : before the commit, the last token was not included in the penalty [c.f. if at line 1338 in original code].

To make the comparison significative, the last token's penalty is included in the penalty returned by reflowProtrudingToken() (hence the removal of the test at line 1260); and it is removed before returning from this function, to keep the same behavior as before.

Typz updated this revision to Diff 115051.Wed, Sep 13, 8:28 AM

Reorder the functions to minimize diff.

I still don't understand yet. breakProtrudingToken has basically two options:

  1. Don't wrap/reflow: In this case the penalty is determined by the number of excess characters.
  2. Wrap/reflow: I this case the penalty is determined by PenaltySplitComments plus the remaining excess characters.

My question is, why do you need to put anything into the state and do this outside of breakProtrudingToken. It seems to me that breakProtrudingToken can do this locally without putting anything into the State.

Typz added a comment.Tue, Sep 19, 4:57 AM

For one thing, we need to update the state to store the "decision" of the reflowing mode, which is performed only in DryRun=true mode, to avoid doing the computation multiple times.

Apart from this, the decision is conceptually internal to breakProtrudingToken(). But the 'computation' of either mode is very similar, and updates the state variable anyway (State.Column, Stack.back().LastSpace, State.Stack[i].BreakBeforeParameter, Token->updateNextToken(State)) : so it is simpler (code-wise) to split the function and call it twice, then decide which 'State' variable to use.

I think doing the computation twice is fine. Or at least, I'd need a test case where it actually shows substantial overhead before doing what you are doing here. Understand that creating more States and making the State object itself larger also has cost and that cost occurs in the combinatorial exploration of the solution space. Doing an additional computation at the end should be comparatively cheap. Look at it this way: During the exploration of the solution space, we might enter breakProtrudingToken many times for the same comment. One more time during reconstruction of the solution is not that harmful.

Typz added a comment.EditedTue, Sep 19, 5:37 AM

I think doing the computation twice is fine. Or at least, I'd need a test case where it actually shows substantial overhead before doing what you are doing here. Understand that creating more States and making the State object itself larger also has cost and that cost occurs in the combinatorial exploration of the solution space. Doing an additional computation at the end should be comparatively cheap. Look at it this way: During the exploration of the solution space, we might enter breakProtrudingToken many times for the same comment. One more time during reconstruction of the solution is not that harmful.

I 've just tried to implement this (e.g. make the 2 calls also when DryRun=false), and I am running into an assertion in WhitespaceManager :

[ RUN      ] FormatTest.ConfigurableUseOfTab
FormatTests: /workspace/llvm/tools/clang/lib/Format/WhitespaceManager.cpp:112: void clang::format::WhitespaceManager::calculateLineBreakInformation(): Assertion `PreviousOriginalWhitespaceEndOffset <= OriginalWhitespaceStartOffset' failed.

This remind there was another "reason" for limiting this to DryRun (not sure it is a good or bad reason): it happens only in the optimizer, all other cases where the indenter is used are not affected.

I am still trying to get to the bottom of this assertion, any hint where to look for?

Typz added a comment.Tue, Sep 19, 5:50 AM

I am still trying to get to the bottom of this assertion, any hint where to look for?

OK, got it. The issue is that we will actually need to run the wrapping 3 times when DryRun = false : call reflowProtrudingToken() twice with DryRun=true to find out the better option, then call it once more with DryRun=false.

Typz updated this revision to Diff 115830.Tue, Sep 19, 6:06 AM

Remove Reflow from LineState, and perform the decision again during reconstruction phase.

Typz marked 5 inline comments as done.Tue, Sep 19, 6:08 AM

I find the current semantics of the functions a bit surprising, specifically:
... reflowProtrudingToken(..., bool Reflow)
is really confusing me :)

I'd have expected something like this where we currently call breakProtrudingToken():

if (CanBreak) {
  ReflowPenalty = breakProtrudingToken(Current, State, /*DryRun=*/true);
  FixedPenalty = getExcessPenalty(Current, State);
  Penalty = ReflowPenalty < FixedPenalty ? ReflowPenalty : FixedPenalty;
  if (!DryRun && ReflowPenalty < FixedPenalty
      breakProtrudingToken(Current, State, /*DryRun=*/false);
}

I haven't looked how we calculate the penalty currently if we can't break, as if we don't, that also seems ... wrong.
getExcessPenalty would be a new function that just focuses on getting the excess penalty for a breakable token.

Typz added a comment.Wed, Sep 20, 1:03 AM

This cannot be implemented where we currently call breakProtrudingToken(), since this function starts by 'creating' the BreakableToken and dealing with meany corner cases: so this should be done before in any case.
But the code at the end of breakProtrudingToken() is actually very similar to what you propose.

I can refactor the code to have two functions reflowProtrudingToken() and getExcessPenalty(), though that will add some significant redundancy: the problem is that even if we don't actually reflow, we still need (I think?) to handle whitespace replacement and update the state (to set Column, Stack.back().LastSpace, and Stack[i].BreakBeforeParameter, and call updateNextToken() ...). And therefore getExcessPenalty() should probably not just compute the penalty, it should also update the state and handle whitespace replacement...

In D33589#876196, @Typz wrote:

This cannot be implemented where we currently call breakProtrudingToken(), since this function starts by 'creating' the BreakableToken and dealing with meany corner cases: so this should be done before in any case.
But the code at the end of breakProtrudingToken() is actually very similar to what you propose.

I can refactor the code to have two functions reflowProtrudingToken() and getExcessPenalty(), though that will add some significant redundancy: the problem is that even if we don't actually reflow, we still need (I think?) to handle whitespace replacement and update the state (to set Column, Stack.back().LastSpace, and Stack[i].BreakBeforeParameter, and call updateNextToken() ...). And therefore getExcessPenalty() should probably not just compute the penalty, it should also update the state and handle whitespace replacement...

My question is: if CanBreak is false, we currently don't call breakProtrudingToken. So either we do something very wrong in that case (which might be true, but I'd like to understand why) or we should be able to just calculate the penalty by not breaking anything and go on.