This is an archive of the discontinued LLVM Phabricator instance.

Restructure how we break tokens.
ClosedPublic

Authored by klimek on Nov 21 2017, 8:00 AM.

Details

Summary

This fixes some bugs in the reflowing logic and splits out the concerns
of reflowing from BreakableToken.

Things to do after this patch:

  • Refactor the breakProtrudingToken function possibly into a class, so we can split it up into methods that operate on the common state.
  • Optimize whitespace compression when reflowing by using the next possible split point instead of the latest possible split point.

Event Timeline

klimek created this revision.Nov 21 2017, 8:00 AM
krasimir edited edge metadata.Nov 22 2017, 3:27 AM

Here're a few nits. I'll need an evening to review the meaty part :)

lib/Format/BreakableToken.cpp
73

nit: no braces around single-statement if body.

196

How about clients that explicitly pass Length = Line.size() - Offset?

484

Could you elaborate?

485

nit: postfix.

577

nit: af -> at

581–582

nit: no braces around single-statement if bodies.

861

could you give an example of what's a non-regular indent?

lib/Format/BreakableToken.h
52

Replace are with is.

100

Does this include the byte Offset + Length?

118–136

text is ambiguous here: does it refer to the content of the line or to the range defined by the offset and length?

125

What is Break used for?

140–142

What is ReflownColumn used for?

163–168

Does the current line refer to the line at LineIndex?

lib/Format/ContinuationIndenter.cpp
1505

Does a reflow count as a break?

unittests/Format/FormatTest.cpp
7735

TODO?

10603

What happened here?

unittests/Format/FormatTestComments.cpp
1104

I think this is desirable. The jsdoc folks really wanted to fix-up the */ at the end. I think it has to do with DelimitersOnNewline.
If this already works in Java and js mode, that could be good enough.

2108

I thinks that we should be compressing whitespace in reflow mode too.

2149

This is like this for cases like lists in comments:

blah-blah-blah:
  1. blah
  2. blah-blah

I think here the block comments behavior might be wrong.

klimek updated this revision to Diff 124075.Nov 23 2017, 6:32 AM
klimek marked 10 inline comments as done.

Address review comments.

klimek added inline comments.Nov 23 2017, 6:32 AM
lib/Format/BreakableToken.cpp
196

That is different (I now also went and updated the comment for getRangeLength to explain that).

Generally, Length == Line.size() - Offset is still a portion of the content, as opposed to npos, which has a special meaning. I'm wondering whether I should just pull out a differently named method for it, now that I'm thinking about it.

lib/Format/BreakableToken.h
100

Reworded.

118–136

Introduced 'text' in the first paragraph and reworded.

140–142

Argh, actually not called ReflownColumn in the implementations :)
Changed and commented.

163–168

Reworded.

lib/Format/ContinuationIndenter.cpp
1505

I do believe so (well, the break in the reflow counts, the reflow itself is not a break, but removing a break :)

unittests/Format/FormatTest.cpp
10603

It was wrong, it's now correct :)
Those characters are printed in double-width, and llvm's encoding library correctly returns this. Reflowing previously did not correctly count that, but used the character count instead.
At offset 8 (after a \t), C " pokes one column over the limit (C == double-width character, space == 1, " == 1 -> 4 -> 12 columns.

unittests/Format/FormatTestComments.cpp
1104

I don't know whether it does - I only know I didn't break any tests :)

2108

Correct, but in a different patch - this patch is a strict improvement regarding whitespace compression, and adding more would make it more complex.

2149

Note that on the doc I shared you voted the reverse ;)

krasimir added inline comments.Nov 23 2017, 7:39 AM
lib/Format/BreakableToken.cpp
196

Yes please!

lib/Format/ContinuationIndenter.cpp
1505

So, if we add a break then remove a break while reflowing, will BreakInserted be true or false?

unittests/Format/FormatTestComments.cpp
2149

Then I should consider re-voting :)

klimek updated this revision to Diff 124095.Nov 23 2017, 8:32 AM

Pull out getRemainingLength.

klimek marked an inline comment as done.Nov 23 2017, 8:34 AM
klimek added inline comments.
lib/Format/ContinuationIndenter.cpp
1505

True.

Started the review. It would take a few cycles 💃

lib/Format/ContinuationIndenter.cpp
1519

Could you please spell out the invariants that we maintain about TailOffset, RemainingTokenColumns and ContentStartColumn (at least) at the beginning of every main loop iteration below? That would surely make it easier to review.

1538

If the previous iteration requested that we try to reflow, how can ContentStartColumn be at the start of the (current) line?

klimek added inline comments.Nov 27 2017, 2:55 AM
lib/Format/ContinuationIndenter.cpp
1519

I'm not sure what you mean - I would literally write what I already wrote in the comment. Can you formulate questions so I can make sure I answer them?

1538

After a line break (that was already in the original text), ContentStartColumn is set to where we want the first character of the broken line to start.

klimek updated this revision to Diff 124381.Nov 27 2017, 7:56 AM

Restructure based on code review feedback.

Restructured to make the invariants clearer based on a chat with Krasimir.

krasimir added inline comments.Nov 28 2017, 2:53 AM
lib/Format/ContinuationIndenter.cpp
1525

If we indent here, shouldn't that also change ContentStartColumn?

1557

How does the last line's penalty get handled in addNextStateToQueue()?

1565–1643

I really dislike this: we shouldn't have the reflow control flow depend on the specific type of the token. Better introduce a new virtual method that enables this branch and override it appropriately.

1578

nit: BreakableToken::Split NextSplit = Token->getSplit(...)

1578

Hm, right now ContentStartColumn + ToSplitColumns points to the column where the character at offset TailOffset + Split.first is supposed to be. Then NextSplit is relative to the offset TailOffset + Split.first + Split.second, so IMO it shouldn't use ContentStartColumn + ToSplitColumns as a start column. I think that ToSplitColumns needs to be computed as follows:

unsigned ToSplitColumns = Token->getRangeLength(LineIndex, TailOffset, Split.first + Split.second, ContentStartColumn);

Also, a comment of the intended meaning of ToSplitColumns would be helpful.

1579

nit: comlumns

klimek updated this revision to Diff 124581.Nov 28 2017, 8:28 AM
klimek marked 3 inline comments as done.

Address review comments.

klimek added inline comments.Nov 28 2017, 8:30 AM
lib/Format/ContinuationIndenter.cpp
1525

Nope, this will exactly adapt the start of the line to ContentStartColumn. ContentStartColumn is where the line thinks it wants to be, not where it is.

1557

By the State's Column being above the column limit.

1578

Good catch, but the solution is differnet. Using ContentStartColumn + ToSplitColumns is incorrect, we need ContentStartColumn + ToSplitColumns + 1 (as Split.second just contains the whitespace, but we want to consider that whitespace compressed).

I tried to write a test for this, and convinced myself that while +1 is correct, it is currently impossible to change behavior based on the missing +1.

krasimir added inline comments.Nov 28 2017, 10:55 AM
lib/Format/BreakableToken.cpp
178

Offtopic: Should add a FIXME that this doesn't really work in the presence of tabs.

lib/Format/BreakableToken.h
154

nit: the comment doesn't make sense anymore.

lib/Format/ContinuationIndenter.cpp
1625

nit: no braces around single-statement if body

1707

When we're reflowing we replace the reflow split with a space, so IMO this should be:

RemainingTokenColumns = Token->getRemainingLength(
              NextLineIndex, TailOffset, ContentStartColumn + 1);
1709

IMO this should be ContentStartColumn + RemainingTokenColumns + 1, accounting for the space that replaces the reflow split.

1721

Looks like ContentStartColumn is consistently used as the start column of the reflown content on the next line.
I suggest to ++ContentStartColumn at the beginning of the body of this if statement (and adjust below appropriately).

Re: "I tried to write a test for this, and convinced myself that while +1 is correct, it is currently impossible to change behavior based on the missing +1.":
In order to have different outcome based on the start column, you could use tabs. Consider the content "aaa\tb" with 4 spaces of tabstop. This takes up 5 columns (say, under the column limit) if it starts at column 0, and 8 columns (say, over the column limit) if it starts at column 1.

klimek updated this revision to Diff 124709.Nov 29 2017, 3:09 AM
klimek marked 4 inline comments as done.

Address review comments.

klimek added inline comments.Nov 29 2017, 3:09 AM
lib/Format/ContinuationIndenter.cpp
1707

Actually, ContentStartColumn is just calculated incorrectly above. Added tests, and added the +1 above, which makes it +1 for all code below.

1721

Yep, that's what I also figured out - that also led to removing ++ContentStartColumn in the reflow case below, which was what made this work at all.
Added tests to ReflowsCommentsPrecise, which flow through all of the corner cases of the if's here.

krasimir added inline comments.Nov 29 2017, 3:27 AM
lib/Format/ContinuationIndenter.cpp
1750

nit: Maybe change this to if (Reflow) and switch the if-else bodies.

1778

Shouldn't we be resetting NewBreakBefore to false here?

klimek added inline comments.Nov 29 2017, 4:32 AM
lib/Format/ContinuationIndenter.cpp
1750

I had that first, but found that harder to follow when re-reading the code - if you feel strongly, I'm also happy to turn it around again :)

1778

NewBreakBefore is always reset at the start of the loop, so resetting it here wouldn't matter.

This revision is now accepted and ready to land.Nov 29 2017, 6:05 AM
This revision was automatically updated to reflect the committed changes.