This is an archive of the discontinued LLVM Phabricator instance.

clang-format: consider not splitting tokens in optimization
AbandonedPublic

Authored by Typz on May 26 2017, 3:11 AM.

Details

Reviewers
krasimir
djasper
Summary

This patch tries to improve the optimizer a bit, to avoid splitting
tokens (e.g. comments/strings) if only there are only few characters
beyond the ColumnLimit.

Previously, comments/strings would be split whenever they went beyond
the ColumnLimit, without consideration for the PenaltyBreakComment/
String and PenaltyExcessCharacter. With this patch, the
OptimizingLineFormatter will also consider not splitting each token,
so that these 'small' offenders get a chance:

// With ColumnLimit=20
int a; // the
       // comment

// With ColumnLimit=20 and PenaltyExcessCharacter = 10
int a; // the comment

This patch does not fully optimize the reflowing, as it will only try
reflowing the whole comment or not reflowing it at all (instead of
trying each split, to allow some lines of overflow ColumnLimit even
when reflowing).

Event Timeline

Typz created this revision.May 26 2017, 3:11 AM
Typz updated this revision to Diff 100379.May 26 2017, 3:15 AM

fix indentation issues

Typz added a comment.EditedMay 26 2017, 3:25 AM

This seems to work fine, but may not find the 'optimal' reflow, thus I would need some feedback:

  • Is this approach desirable, as a relatively easy fix?
  • Or should this be fixed with a complete refactoring of the way the strings/comments are split, making multiple tokens out of them to let them be reflown 'directly' by the optimizing line formatter? (keep the optimizer, but changes the whole way comments are handled, and need to move all the information about the comment block and its relfowing into LineState)
  • Or should the breakProtrudingToken() actually perform an optimization of the split of this token? (keeps the same overall structure, but requires writing another optimizer)
unittests/Format/FormatTest.cpp
8571

This is not working as expected, format return:

int a; /* first line
          * second *
          line third
          * line */

Any clue how things could go so wrong?

djasper edited reviewers, added: alexfh; removed: djasper.May 26 2017, 4:46 AM
Typz marked an inline comment as done.May 30 2017, 5:25 AM
Typz added inline comments.
unittests/Format/FormatTest.cpp
8571

This was a bug in tests: should not use test::messUp() with multi-line comments.

Typz updated this revision to Diff 100698.May 30 2017, 5:27 AM
Typz marked an inline comment as done.

fix code & tests

krasimir edited edge metadata.Jun 1 2017, 3:55 AM

I think that what you're trying to solve is not practically that important, is unlikely to improve the handling of comments, and will add a lot of complexity.

From a usability perspective, I think that people are happy enough when their comments don't exceed the line limit. I personally wouldn't want the opposite to happen. I've even seen style guides that have 80 columns limit for comments and 100 for code.

Comments are treated in a somewhat ad-hoc style in clang-format, which is because they are inherently free text and don't have a specific format. People just expect comments to be handled quite differently than source code. There are at least three separate parts in clang-format that make up the formatting of comments: the normal parsing and optimization pipeline, the BreakableLineCommentSection / BreakableBlockComment classes that are responsible for breaking and reflowing inside comment sections, and the consecutive trailing comment alignment in the WhitespaceManager. This patch takes into account the first aspect but not the consequences for the other aspects: for example it allows for the first line comment token in a multiline line comment section to get out of the column limit, but will reflow the rest of the lines. A WhitespaceManager-related issue is that because a trailing line comment for some declaration might not get split, it might not be aligned with the surrounding trailing line comments, which I think is a less desirable effect.

Optimizing the layout in comment sections by using the optimizing formatter sounds good, but because comments mostly contain free text that can be structured in unexpected ways, I think that the ad-hoc approach works better. Think of not destroying ASCII art and supporting bulleted and numbered lists for example. We don't really want to leak lots of comment-related formatting tweaks into the general pipeline itself.

I see you point that PenaltyBreakComment and PenaltyExcessCharacter are not taken into account while comment placement, but they are taken into account at a higher level by treating the whole comment section as a unit. For example, if you have a long declaration that has multiple potential split points followed by a long trailing comment section, the penalty induced by the number of lines that are needed and the number of unbroken characters for the formatting of the comment section is taken into account while optimizing the layout of the whole code fragment.

The formatted currently supports somewhat limited version of allowing comment lines exceeding the column limit, like long hyperlinks for example. I think that if there are some other examples which are both widespread and super annoying, we may consider them separately.

Typz added a comment.EditedJun 6 2017, 4:59 AM

I think that what you're trying to solve is not practically that important, is unlikely to improve the handling of comments, and will add a lot of complexity.

Not sure the 'approach' I have in this patch is nice, but it does solve my problem: it is quite simple, and avoids reflowing (and actually, sometimes breaking, since comments are not so structured...) the comments.
e.g. if I have a line followed by a relatively long comment, with my patch it will really consider the penaltys properly, and not split a line which is slightly longer [with ColumnLimit=30] :

int a; //< a very long comment

vs

int a; //< a very long
       //< comment

From a usability perspective, I think that people are happy enough when their comments don't exceed the line limit. I personally wouldn't want the opposite to happen. I've even seen style guides that have 80 columns limit for comments and 100 for code.

That is a matter of taste and coding style: I prefer overflowing by a few characters instead of introducing an extra line... I see no reason to allow customizing PenaltyExcessCharacter yet not use this value to make the decision.

Comments are treated in a somewhat ad-hoc style in clang-format, which is because they are inherently free text and don't have a specific format. People just expect comments to be handled quite differently than source code. There are at least three separate parts in clang-format that make up the formatting of comments: the normal parsing and optimization pipeline, the BreakableLineCommentSection / BreakableBlockComment classes that are responsible for breaking and reflowing inside comment sections, and the consecutive trailing comment alignment in the WhitespaceManager. This patch takes into account the first aspect but not the consequences for the other aspects: for example it allows for the first line comment token in a multiline line comment section to get out of the column limit, but will reflow the rest of the lines. A WhitespaceManager-related issue is that because a trailing line comment for some declaration might not get split, it might not be aligned with the surrounding trailing line comments, which I think is a less desirable effect.

Not sure I understand the case you are speaking about, can you please provide an exemple?

(on a separate topic, I could indeed not find a penalty for breaking alignment; I think the optimizer should account for that as well... any hint on where to look for adding this?)

Optimizing the layout in comment sections by using the optimizing formatter sounds good, but because comments mostly contain free text that can be structured in unexpected ways, I think that the ad-hoc approach works better. Think of not destroying ASCII art and supporting bulleted and numbered lists for example. We don't really want to leak lots of comment-related formatting tweaks into the general pipeline itself.

Good, one less choice :-)

And I indeed agree optimizing lines 'individually' may be much worse than the current ad-hoc approach, unless we could depend on extra syntax in comments (like full understanding of markdown and doxygen syntax), but that is out of topic here.

But this patch does try to move away from the adhoc approach: i am just trying to allow *not* wrapping the comments when it gives better score, based on the PenaltyBreakComment and PenaltyExcessCharacter values.

I see you point that PenaltyBreakComment and PenaltyExcessCharacter are not taken into account while comment placement, but they are taken into account at a higher level by treating the whole comment section as a unit. For example, if you have a long declaration that has multiple potential split points followed by a long trailing comment section, the penalty induced by the number of lines that are needed and the number of unbroken characters for the formatting of the comment section is taken into account while optimizing the layout of the whole code fragment.

If you reduce PenaltyExcessCharacter you see that comments can never overflow: the optimizer will consider splitting the comments or splitting the remaining of the code, but will (currently) not allow the comment to overflow just a bit.

The formatted currently supports somewhat limited version of allowing comment lines exceeding the column limit, like long hyperlinks for example. I think that if there are some other examples which are both widespread and super annoying, we may consider them separately.

This patch does not try to address these special cases, and should not change the behavior in this case.

Typz added a comment.Jun 19 2017, 1:34 AM

@krasimir, @alexfh : can I get some feedback?

This patch solves a practical problem, i.e. allowing the comment to overflow a bit without triggering a reflow [according to the priorities which are in config, obviously]. It may not always provide the "best" wrapping, but as you pointed out comments are unstructured, so processing must indeed stay quite conservative: which is how this patch tries to behave.

Is there a better way to do this? or do you see something missing?

There are at least three separate parts in clang-format that make up the formatting of comments: the normal parsing and optimization pipeline, the BreakableLineCommentSection / BreakableBlockComment classes that are responsible for breaking and reflowing inside comment sections, and the consecutive trailing comment alignment in the WhitespaceManager. This patch takes into account the first aspect but not the consequences for the other aspects

You suggested there is an issue, but I could not isolate it. Can you provide an exemple?

alexfh edited reviewers, added: djasper; removed: alexfh.Jun 23 2017, 7:35 AM
alexfh added a subscriber: alexfh.Jun 23 2017, 7:37 AM

Daniel is a better reviewer here than myself. A few cents from me though: why do we want to make an exception for comments and not for regular code?

Typz added a comment.Jun 23 2017, 8:49 AM

why do we want to make an exception for comments and not for regular code?

This is not an exception for comments: the PenaltyExcessCharacter is used whenever the code is longer than the ColumnLimit, and used to compute the best solution.
The default value is extremely high, so there is no practical difference: code does not go beyond the column, except for specific exceptions.

However, the difference becomes apparent when reducing this penalty: in that case, code can break the column limit instead of wrapping, but it does not happen with comments. This patch tries to restore the parity, and let comments break the comment limit as well.

djasper added inline comments.Jul 4 2017, 1:51 AM
lib/Format/UnwrappedLineFormatter.cpp
723

This is almost certainly a no-go. It turns the algorithm from exploring a state space with a branching factor of 2 into something with a branching factor of 4.

Even assuming we can perfectly reuse results from computations (or in other words hit the same state through different path using Dijkstra's algorithm), the additional bool in StateNode doubles the number of states we are going to visit. You don't even seem to make a distinction of whether the token under investigation can possibly be split. You do look at NoLineBreak(InOperand), but even if those are false, the vast majority of tokens will never be split.

However, even if you narrow that down, I am not convinced that fixing this inconsistency around the formatting penalties is worth the runtime performance hit. I am generally fine with changing the behavior in the way you are proposing, but only if it has zero (negative) effect on performance.

764

It's not clear to me why you'd return here.

Typz added inline comments.Jul 13 2017, 5:17 AM
lib/Format/UnwrappedLineFormatter.cpp
723

Making the distinction to skip some path is done at the beginning of addNextStateToQueue(), though indeed not very precisely at the moment.

I can improve the check (i.e. by copying all the early return conditions from the beginning of ContinuationIndenter::breakProtrudingToken(), which will greatly reduce the number of possible state, but stilll have a small impact.

The alternative would be modify ContinuationIndenter::breakProtrudingToken(), so that it computes the penalty for reflowing as well as not-reflowing, and updates the LineState with the best solution. It would certainly have an impact on performance, but would not increase the size of the state space.

Another issue with that approach is that the information is not passed from the DRYRUN phase to the actual rewriting phase. I could store this information in the LineState, to re-use it in the reconstruction phase, but this seems really wrong (and would work only in the exact case of the optimizing line formatter).

What do you think? Should I keep the same structure but reduce the number of explored state; or move the decision into ContinuationIndenter, possibly storing the result in LineState?

764

This is needed to avoid trying to break when this is not supported: no point doing twice the same thing.

Typz marked 2 inline comments as done.Jul 13 2017, 6:30 AM
Typz added inline comments.
lib/Format/UnwrappedLineFormatter.cpp
723

Actually, it seems it cannot be done inside ContinuationIndenter::breakProtrudingToken(), since this causes the whitespace manager to be called twice for the same token.

Typz updated this revision to Diff 106442.Jul 13 2017, 9:30 AM

Move code out of optimizer, directly into ContinuationIndenter::breakProtrudingToken(), to minimize impact on performance.
Comment reflowing of breakable items which break the limit will be slightly slower, since we now consider the two options; however this change has no impact on the performance of processing non-breakable items, and may actually increase the operformance of processing breakable items which do not need to be reflowed.

djasper edited edge metadata.Sep 12 2017, 2:13 AM

I have a slightly hard time grasping what this patch now actually does? Doesn't it simply try to decide whether or not to make a split locally be comparing the PenaltyBreakComment against the penalty for the access characters? If so, couldn't we simply do that as an implementation detail of breakProtrudingToken() without needing to let anything outside of it now and without introducing State.Reflow?

lib/Format/ContinuationIndenter.cpp
1330

Can you create a patch that doesn't move the code around so much? Seems unnecessary and hard to review.

1437

I believe that this is incorrect. reflowProtrudingToken counts the length of the unbreakable tail and here you just remove the penalty of the token itself. E.g. in:

string s = f("aaa");

the ");" is the unbreakable tail of the stringl

Typz added inline comments.Sep 13 2017, 7:50 AM
lib/Format/ContinuationIndenter.cpp
1330

Moving the code around is an unfortunate requirement for this patch: we must actually do the "reflowing" twice, to find out which solution (actually reflowing or not) gives the least penalty.

Therefore the function that reflowing must be moved out of breakProtrudingToken()...

All I can do is try moving the new function after breakProtrudingToken(), maybe Phabricator will better show the change.

1437

This behavior has not changed : before the commit, the last token was not included in the penalty [c.f. if at line 1338 in original code].

To make the comparison significative, the last token's penalty is included in the penalty returned by reflowProtrudingToken() (hence the removal of the test at line 1260); and it is removed before returning from this function, to keep the same behavior as before.

Typz updated this revision to Diff 115051.Sep 13 2017, 8:28 AM

Reorder the functions to minimize diff.

I still don't understand yet. breakProtrudingToken has basically two options:

  1. Don't wrap/reflow: In this case the penalty is determined by the number of excess characters.
  2. Wrap/reflow: I this case the penalty is determined by PenaltySplitComments plus the remaining excess characters.

My question is, why do you need to put anything into the state and do this outside of breakProtrudingToken. It seems to me that breakProtrudingToken can do this locally without putting anything into the State.

Typz added a comment.Sep 19 2017, 4:57 AM

For one thing, we need to update the state to store the "decision" of the reflowing mode, which is performed only in DryRun=true mode, to avoid doing the computation multiple times.

Apart from this, the decision is conceptually internal to breakProtrudingToken(). But the 'computation' of either mode is very similar, and updates the state variable anyway (State.Column, Stack.back().LastSpace, State.Stack[i].BreakBeforeParameter, Token->updateNextToken(State)) : so it is simpler (code-wise) to split the function and call it twice, then decide which 'State' variable to use.

I think doing the computation twice is fine. Or at least, I'd need a test case where it actually shows substantial overhead before doing what you are doing here. Understand that creating more States and making the State object itself larger also has cost and that cost occurs in the combinatorial exploration of the solution space. Doing an additional computation at the end should be comparatively cheap. Look at it this way: During the exploration of the solution space, we might enter breakProtrudingToken many times for the same comment. One more time during reconstruction of the solution is not that harmful.

Typz added a comment.EditedSep 19 2017, 5:37 AM

I think doing the computation twice is fine. Or at least, I'd need a test case where it actually shows substantial overhead before doing what you are doing here. Understand that creating more States and making the State object itself larger also has cost and that cost occurs in the combinatorial exploration of the solution space. Doing an additional computation at the end should be comparatively cheap. Look at it this way: During the exploration of the solution space, we might enter breakProtrudingToken many times for the same comment. One more time during reconstruction of the solution is not that harmful.

I 've just tried to implement this (e.g. make the 2 calls also when DryRun=false), and I am running into an assertion in WhitespaceManager :

[ RUN      ] FormatTest.ConfigurableUseOfTab
FormatTests: /workspace/llvm/tools/clang/lib/Format/WhitespaceManager.cpp:112: void clang::format::WhitespaceManager::calculateLineBreakInformation(): Assertion `PreviousOriginalWhitespaceEndOffset <= OriginalWhitespaceStartOffset' failed.

This remind there was another "reason" for limiting this to DryRun (not sure it is a good or bad reason): it happens only in the optimizer, all other cases where the indenter is used are not affected.

I am still trying to get to the bottom of this assertion, any hint where to look for?

Typz added a comment.Sep 19 2017, 5:50 AM

I am still trying to get to the bottom of this assertion, any hint where to look for?

OK, got it. The issue is that we will actually need to run the wrapping 3 times when DryRun = false : call reflowProtrudingToken() twice with DryRun=true to find out the better option, then call it once more with DryRun=false.

Typz updated this revision to Diff 115830.Sep 19 2017, 6:06 AM

Remove Reflow from LineState, and perform the decision again during reconstruction phase.

Typz marked 5 inline comments as done.Sep 19 2017, 6:08 AM

I find the current semantics of the functions a bit surprising, specifically:
... reflowProtrudingToken(..., bool Reflow)
is really confusing me :)

I'd have expected something like this where we currently call breakProtrudingToken():

if (CanBreak) {
  ReflowPenalty = breakProtrudingToken(Current, State, /*DryRun=*/true);
  FixedPenalty = getExcessPenalty(Current, State);
  Penalty = ReflowPenalty < FixedPenalty ? ReflowPenalty : FixedPenalty;
  if (!DryRun && ReflowPenalty < FixedPenalty
      breakProtrudingToken(Current, State, /*DryRun=*/false);
}

I haven't looked how we calculate the penalty currently if we can't break, as if we don't, that also seems ... wrong.
getExcessPenalty would be a new function that just focuses on getting the excess penalty for a breakable token.

Typz added a comment.Sep 20 2017, 1:03 AM

This cannot be implemented where we currently call breakProtrudingToken(), since this function starts by 'creating' the BreakableToken and dealing with meany corner cases: so this should be done before in any case.
But the code at the end of breakProtrudingToken() is actually very similar to what you propose.

I can refactor the code to have two functions reflowProtrudingToken() and getExcessPenalty(), though that will add some significant redundancy: the problem is that even if we don't actually reflow, we still need (I think?) to handle whitespace replacement and update the state (to set Column, Stack.back().LastSpace, and Stack[i].BreakBeforeParameter, and call updateNextToken() ...). And therefore getExcessPenalty() should probably not just compute the penalty, it should also update the state and handle whitespace replacement...

In D33589#876196, @Typz wrote:

This cannot be implemented where we currently call breakProtrudingToken(), since this function starts by 'creating' the BreakableToken and dealing with meany corner cases: so this should be done before in any case.
But the code at the end of breakProtrudingToken() is actually very similar to what you propose.

I can refactor the code to have two functions reflowProtrudingToken() and getExcessPenalty(), though that will add some significant redundancy: the problem is that even if we don't actually reflow, we still need (I think?) to handle whitespace replacement and update the state (to set Column, Stack.back().LastSpace, and Stack[i].BreakBeforeParameter, and call updateNextToken() ...). And therefore getExcessPenalty() should probably not just compute the penalty, it should also update the state and handle whitespace replacement...

My question is: if CanBreak is false, we currently don't call breakProtrudingToken. So either we do something very wrong in that case (which might be true, but I'd like to understand why) or we should be able to just calculate the penalty by not breaking anything and go on.

Typz added a comment.Oct 24 2017, 9:16 AM

My question is: if CanBreak is false, we currently don't call breakProtrudingToken. So either we do something very wrong in that case (which might be true, but I'd like to understand why) or we should be able to just calculate the penalty by not breaking anything and go on.

CanBreak is currently very short, it only verifies some very broad conditions. I initiallly tried patching at this level, but it really does not work.

Most conditions are actually tested at the beginning of breakProtrudingToken. That part of the code actually takes different branches for different kind of tokens (strings, block commands, line comments...), and handles many different cases for each. This goal is to create the actual BreakableToken object, but it has a few other side effects: it returns immediately in various corner cases (javascript, preprocessor, formatting macros....), and it tweaks some variables, e.g. ColumnLimit.

In addition, replacement is non trivial even in case we choose not to reflow: it is required in order to properly manage whitespaces. For this reason we really must pass through the loop in most cases.

In D33589#920160, @Typz wrote:

ping ?

I'm working on understanding this better :) I've refactored the code a bit so I could fully understand the problem, which I now do (sorry for this taking a while, but it took me multiple hours to work through this). I'm now convinced that you're right that we need the reflow/breaking logic to make the decision about the trade-offs, as we need the penalty for the potentially reflown code, as opposed to the penalty for the code if we do nothing.
Given that, my current gut feeling is that we'll want to go a bit further in that direction and make the change more local in breakProtrudingToken. I'll give that a spin and report back with what I find.

One interesting trade-off I'm running into:
My gut feeling is that we really want to make local decisions about whether we want to break/reflow - this makes the code significantly simpler (IMO), and handles all tests in this patch correctly, but is fundamentally limiting the global optimizations we can do. Specifically, we would not correctly reflow this:

//       |< limit
// foo bar
// baz
// x

to

// foo
// bar
// baz x

when the excess character limit is low.

That would be a point for global optimization, but I find it really hard to understand exactly why it's ok to do it. Won't we get things like this wrong:

Limit: 13
// foo  bar baz
// bab      bob

as we'll not compress whitespace?

I think this patch doesn't handle a couple of cases that I'd like to see handled. A counter-proposal with different trade-offs is in D40068.

Typz added a comment.Nov 21 2017, 7:40 AM

One interesting trade-off I'm running into:
My gut feeling is that we really want to make local decisions about whether we want to break/reflow - this makes the code significantly simpler (IMO), and handles all tests in this patch correctly, but is fundamentally limiting the global optimizations we can do. Specifically, we would not correctly reflow this:

//       |< limit
// foo bar
// baz
// x

to

// foo
// bar
// baz x

when the excess character limit is low.

As I can see with your patch, local decision does not account for accumulated penalty on multi-line comment, and will thus give unexpected (e.g. no change) result when each line overlaps by a few characters, but not enough to trigger a break at this line.

That would be a point for global optimization, but I find it really hard to understand exactly why it's ok to do it. Won't we get things like this wrong:

Limit: 13
// foo  bar baz
// bab      bob

as we'll not compress whitespace?

Indeed, this patch would not trigger whitespace compression when not reflowing; it would compare "not doing anything" (no reflow, no whitespace compression) with the complete reflowing (including whitespace compression). I don't think that would break anything, but indeed we could possibly get even better result by trying to apply whitespace compression in the no-reflow case [which should be simple, just a bit more code at line 1376 in the version of the patch].

Typz added a comment.EditedNov 21 2017, 7:46 AM

I think this patch doesn't handle a couple of cases that I'd like to see handled. A counter-proposal with different trade-offs is in D40068.

Can you provide more info on these cases? Are they all added to the tests of D40068?

It may be simpler (though not to my eyes, I am not knowledgeable enough to really understand how you go this fixed...), and works fine for "almost correct" comments: e.g. when there are indeed just a few extra characters overall. But it still procudes strange result when each line of the (long) comment is too long, but not enough to trigger a line-wrap by itself.

Since that version has landed already, not sure how to improve on this. I could probably rewrite my patch on master, but it seems a bit redundant. As a simpler fix, I could imagine adding a "total" overflow counter, to allow detecting the situation; but when this is detected (e.g. on subsequent lines) we would need to "backtrack" and revisit the initial decision...

Typz added a comment.Nov 21 2017, 8:07 AM

Btw, another issue I am having is that reflowing does not respect the alignment. For exemple:

enum {
   Foo,    ///< This is a very long comment
   Bar,    ///< This is shorter  
   BarBar, ///< This is shorter
} Stuff;

will be reflown to :

enum {
   Foo, ///< This is a very long
        ///< comment
   Bar, ///< This is shorter  
   BarBar, ///< This is shorter
} Stuff;

when I would expect:

enum {
   Foo,    ///< This is a very
           ///< long comment
   Bar,    ///< This is shorter  
   BarBar, ///< This is shorter
} Stuff;

I see there is no penalty for breaking alignment, which may be accounted for when compressing the whitespace 'before' the comment... Or maybe simply the breaking should be smarter, to try to keep the alignment; but I am not sure it can be done without another kind of 'global' optimization of the comment reflow... Any idea/hint?

In D33589#931723, @Typz wrote:

I think this patch doesn't handle a couple of cases that I'd like to see handled. A counter-proposal with different trade-offs is in D40068.

Can you provide more info on these cases? Are they all added to the tests of D40068?

It may be simpler (though not to my eyes, I am not knowledgeable enough to really understand how you go this fixed...), and works fine for "almost correct" comments: e.g. when there are indeed just a few extra characters overall. But it still procudes strange result when each line of the (long) comment is too long, but not enough to trigger a line-wrap by itself.

Since that version has landed already, not sure how to improve on this. I could probably rewrite my patch on master, but it seems a bit redundant. As a simpler fix, I could imagine adding a "total" overflow counter, to allow detecting the situation; but when this is detected (e.g. on subsequent lines) we would need to "backtrack" and revisit the initial decision...

Yes, I added all tests I was thinking of to the patch. Note that we can always go back on submitted patches / do things differently. As my patch fulfilled all your tests, I (perhaps incorrectly?) assumed it solved your use cases - can you give me a test case of what you would want to happen that doesn't happen with my patch?

Are you, for example, saying that in the case

Limit; 13
// foo bar baz foo bar baz foo bar baz
you'd not want
// foo bar baz
// foo bar baz
// foo bar baz
If the overall penalty of the protruding tokens is - what? More than 1 break penalty? If you care about more than 3x break penalty (which I would think is correct), the local algorithm will work, as local decisions will make sure the overall penalty cannot be exceeded.
In D33589#931802, @Typz wrote:

Btw, another issue I am having is that reflowing does not respect the alignment. For exemple:

enum {
   Foo,    ///< This is a very long comment
   Bar,    ///< This is shorter  
   BarBar, ///< This is shorter
} Stuff;

will be reflown to :

enum {
   Foo, ///< This is a very long
        ///< comment
   Bar, ///< This is shorter  
   BarBar, ///< This is shorter
} Stuff;

when I would expect:

enum {
   Foo,    ///< This is a very
           ///< long comment
   Bar,    ///< This is shorter  
   BarBar, ///< This is shorter
} Stuff;

I see there is no penalty for breaking alignment, which may be accounted for when compressing the whitespace 'before' the comment... Or maybe simply the breaking should be smarter, to try to keep the alignment; but I am not sure it can be done without another kind of 'global' optimization of the comment reflow... Any idea/hint?

I think that's just a bug in comment alignment, which is done at a different stage.

In D33589#931723, @Typz wrote:

I think this patch doesn't handle a couple of cases that I'd like to see handled. A counter-proposal with different trade-offs is in D40068.

Can you provide more info on these cases? Are they all added to the tests of D40068?

It may be simpler (though not to my eyes, I am not knowledgeable enough to really understand how you go this fixed...), and works fine for "almost correct" comments: e.g. when there are indeed just a few extra characters overall. But it still procudes strange result when each line of the (long) comment is too long, but not enough to trigger a line-wrap by itself.

Since that version has landed already, not sure how to improve on this. I could probably rewrite my patch on master, but it seems a bit redundant. As a simpler fix, I could imagine adding a "total" overflow counter, to allow detecting the situation; but when this is detected (e.g. on subsequent lines) we would need to "backtrack" and revisit the initial decision...

Yes, I added all tests I was thinking of to the patch. Note that we can always go back on submitted patches / do things differently. As my patch fulfilled all your tests, I (perhaps incorrectly?) assumed it solved your use cases - can you give me a test case of what you would want to happen that doesn't happen with my patch?

Are you, for example, saying that in the case

Limit; 13
// foo bar baz foo bar baz foo bar baz
you'd not want
// foo bar baz
// foo bar baz
// foo bar baz
If the overall penalty of the protruding tokens is - what? More than 1 break penalty? If you care about more than 3x break penalty (which I would think is correct), the local algorithm will work, as local decisions will make sure the overall penalty cannot be exceeded.

Ok, sorry, after reading the comment on the other patch I get it :)
In the above example, we add 3 line breaks, and we'd add 1 (or more) additional line breaks when reflowing below the column limit.
I agree that that can lead to different overall outcomes, but I don't see how the approach of this patch really fixes it - it will only ever reflow below the column limit, so it'll also lead to states for long lines where reflowing and leaving chars over the line limit might be the overall best choice (unless I'm missing something)?

Typz added a comment.Nov 23 2017, 6:35 AM

@klimek wrote:
In the above example, we add 3 line breaks, and we'd add 1 (or more) additional line breaks when reflowing below the column limit.
I agree that that can lead to different overall outcomes, but I don't see how the approach of this patch really fixes it - it will only ever reflow below the column limit, so it'll also lead to states for long lines where reflowing and leaving chars over the line limit might be the overall best choice (unless I'm missing something)?

Definitely, this patch may not find the 'optimal' solution. What I mean is that we reduce the PenaltyExcessCharacter value to allow "occasionally" breaking the limit: instead of a hard limit, we want to allow lines to sometimes break the limit, but definitely not *all* the time. Both patch work fine when the code is "correct", i.e. there is indeed only a few lines which break the limit.

But the local decision approach behaves really wrong IMHO when the code is formatted beyond the column: it can very easily reformat in such a way that the comment is reflown to what looks like a longer column limit. I currently have a ratio of 10 between PenaltyBreakComment and PenaltyExcessCharacter (which empirically seemed to give a decent compromise, and match how our code is formatted; I need to try more with your patch, to see if we can get better values...): with this setting, a "non wrapped" comment will actually be reflown to ColumnLimit+10...

When we do indeed reflow, I think we may be stricter than this, to get something that really looks like it obeys the column limit. If this is 'optimal' in the sense that we may have some overflow still, that is fine, but really not the primary requirement IMHO.

In D33589#933746, @Typz wrote:

@klimek wrote:
In the above example, we add 3 line breaks, and we'd add 1 (or more) additional line breaks when reflowing below the column limit.
I agree that that can lead to different overall outcomes, but I don't see how the approach of this patch really fixes it - it will only ever reflow below the column limit, so it'll also lead to states for long lines where reflowing and leaving chars over the line limit might be the overall best choice (unless I'm missing something)?

Definitely, this patch may not find the 'optimal' solution. What I mean is that we reduce the PenaltyExcessCharacter value to allow "occasionally" breaking the limit: instead of a hard limit, we want to allow lines to sometimes break the limit, but definitely not *all* the time. Both patch work fine when the code is "correct", i.e. there is indeed only a few lines which break the limit.

But the local decision approach behaves really wrong IMHO when the code is formatted beyond the column: it can very easily reformat in such a way that the comment is reflown to what looks like a longer column limit. I currently have a ratio of 10 between PenaltyBreakComment and PenaltyExcessCharacter (which empirically seemed to give a decent compromise, and match how our code is formatted; I need to try more with your patch, to see if we can get better values...): with this setting, a "non wrapped" comment will actually be reflown to ColumnLimit+10...

To me the reverse intuitively makes sense:
When I see that
first line
second line
third line
stays the same, but
first line second line third line
gets reflown into something different, I get confused :)

The only way I see to consistently solve this is to optimize the reflow breaks like we do the main algorithm.

When we do indeed reflow, I think we may be stricter than this, to get something that really looks like it obeys the column limit. If this is 'optimal' in the sense that we may have some overflow still, that is fine, but really not the primary requirement IMHO.

In D33589#933746, @Typz wrote:

with this setting, a "non wrapped" comment will actually be reflown to ColumnLimit+10...

Isn't the same true for code then, though? Generally, code will protrude by 10 columns before being broken?

Typz added a comment.Dec 1 2017, 5:50 AM

I think the difference between code and comments is that code "words" are easily 10 characters or more, whereas actual words (in comments) are very often less than 10 characters: so code overflowing by 10 characters is not very frequent. whereas small words in comment will often get closer to the "extra" limit.

That said, I tried with your latest change ("Restructure how we break tokens", sha1:64d42a2fb85ece5987111ffb908c6bc7f7431dd4). and it's working about fine now. For the most part it seems to wrap when I would expect it, great work!
I have seen 2 "issues" though:

  • Often I see that the last word before reflowing is not wrapped (eventhough it overlaps the line length); I did not count penalties so I cannot confirm this is really an issue or just a borderline scenario.
  • Alignment seems better than before, but since there is no penalty for breaking alignment it will always try to unindent to compensate for overflowing characters...

Seeing this, I guess this patch does not make much sense anymore, I'll see if I make some improvements for these two issues, in separate patches.

Typz abandoned this revision.Dec 1 2017, 5:50 AM
klimek added a comment.Dec 1 2017, 5:54 AM
In D33589#941979, @Typz wrote:

I think the difference between code and comments is that code "words" are easily 10 characters or more, whereas actual words (in comments) are very often less than 10 characters: so code overflowing by 10 characters is not very frequent. whereas small words in comment will often get closer to the "extra" limit.

That said, I tried with your latest change ("Restructure how we break tokens", sha1:64d42a2fb85ece5987111ffb908c6bc7f7431dd4). and it's working about fine now. For the most part it seems to wrap when I would expect it, great work!
I have seen 2 "issues" though:

  • Often I see that the last word before reflowing is not wrapped (eventhough it overlaps the line length); I did not count penalties so I cannot confirm this is really an issue or just a borderline scenario.
  • Alignment seems better than before, but since there is no penalty for breaking alignment it will always try to unindent to compensate for overflowing characters...

Seeing this, I guess this patch does not make much sense anymore, I'll see if I make some improvements for these two issues, in separate patches.

Note that I just 10 mins ago landed another change (r319541) that should fix the main issue you raised, and which looks a lot like I wanted this change to look back when I first saw it (but of course all the underlying code made that impossible). Please give it a try and let me know what you think.

Typz added a comment.Dec 1 2017, 7:16 AM

Indeed, looks good, thanks!

Though that exacerbates the alignment issue, I now get things like this:

enum {
  SomeLongerEnum, // comment
  SomeThing,      // comment
  Foo, // something
} 
                            ^ (column limit)

The comment on 'Foo' would overflow a bit, but it gets unindented during "alingment" stage, which kind of 'breaks' the decision that was made earlier on *not* to break the comment...

klimek added a comment.Dec 1 2017, 7:47 AM
In D33589#942128, @Typz wrote:

Indeed, looks good, thanks!

Though that exacerbates the alignment issue, I now get things like this:

enum {
  SomeLongerEnum, // comment
  SomeThing,      // comment
  Foo, // something
} 
                            ^ (column limit)

The comment on 'Foo' would overflow a bit, but it gets unindented during "alingment" stage, which kind of 'breaks' the decision that was made earlier on *not* to break the comment...

Ok, that seems like a different bug that we can fix in alignment, though :)