This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Format raw string literals
ClosedPublic

Authored by krasimir on Jul 27 2017, 7:36 AM.

Details

Summary

This patch adds raw string literal formatting.

It's still work in progress, however feedback is welcome.

Diff Detail

Repository
rL LLVM

Event Timeline

krasimir created this revision.Jul 27 2017, 7:36 AM
klimek added inline comments.Jul 27 2017, 8:29 AM
lib/Format/ContinuationIndenter.cpp
89 ↗(On Diff #108460)

Why 19?

1259–1262 ↗(On Diff #108460)

Why 3? Generally, when you use magic values other than 0, -1, 1, comment on what exactly they mean.

lib/Format/ContinuationIndenter.h
41 ↗(On Diff #108460)

Why getGoogleStyle?

krasimir updated this revision to Diff 108485.Jul 27 2017, 9:11 AM
  • Fixed formatting
krasimir updated this revision to Diff 108487.Jul 27 2017, 9:20 AM
krasimir marked 2 inline comments as done.
  • Add comments on magic
krasimir retitled this revision from Add string delimiters and compute raw string styles to [clang-format] Format raw string literals.Jul 27 2017, 9:23 AM
krasimir edited the summary of this revision. (Show Details)
krasimir added reviewers: djasper, klimek.
krasimir updated this revision to Diff 108493.Jul 27 2017, 9:45 AM
  • Add tests for delimiters and make delimiter search case-insensitive
krasimir updated this revision to Diff 108631.Jul 28 2017, 5:16 AM
  • Add tests for raw strings in arguments
krasimir updated this revision to Diff 108635.Jul 28 2017, 5:52 AM
  • Make reformatRawStringLiteral return the new state column
  • Add FormattedRawString cache
krasimir updated this revision to Diff 108640.Jul 28 2017, 6:29 AM
  • Add BasedOnStyle to RawStringFormat
krasimir added inline comments.Jul 28 2017, 6:30 AM
lib/Format/ContinuationIndenter.h
41 ↗(On Diff #108460)

Added BasedOnStyle to the RawStringStyle type, now it's not arbitrary anymore. Thank you!

krasimir updated this revision to Diff 108642.Jul 28 2017, 6:42 AM
  • Add tests for case-sensitive delimiter matching
krasimir updated this revision to Diff 108654.Jul 28 2017, 7:41 AM
  • Add tests for raw strings starting on own line
krasimir updated this revision to Diff 108892.Jul 31 2017, 3:24 AM
  • Add tests for breaks before raw strings
krasimir updated this revision to Diff 108903.Jul 31 2017, 5:17 AM
  • Add tests for raw string operands
krasimir updated this revision to Diff 108904.Jul 31 2017, 5:22 AM
  • Add more tests for raw string operands
krasimir updated this revision to Diff 108936.Jul 31 2017, 9:03 AM
  • Add LastStartColumn and prefix and postfix raw string alignment
krasimir updated this revision to Diff 109316.Aug 2 2017, 3:52 AM
  • Adjust the indent of raw string suffixes
krasimir updated this revision to Diff 109358.Aug 2 2017, 8:21 AM
  • Add a test demonstrating exceeding the column limit
krasimir updated this revision to Diff 109529.Aug 3 2017, 5:56 AM
  • Propagate estimated raw string penalty to the reflow optimization
krasimir updated this revision to Diff 109554.Aug 3 2017, 7:21 AM
  • Rebase with master

I think this is ready for a round of reviews.

klimek added inline comments.Aug 4 2017, 2:25 AM
lib/Format/ContinuationIndenter.cpp
83 ↗(On Diff #109554)

If it == 5, it''s also trivially empty. It can't really be 6 (as it needs to be repeated), so only 5 + 2n are valid sizes? So < 7 would work here?

95 ↗(On Diff #109554)

After tokenization, can't we assume that's true?

138 ↗(On Diff #109554)

Why do we need a new parameter to pass through instead of using FirstIndent?

1011 ↗(On Diff #109554)

Add a comment why this happens (-> reading this, I assume the State.Column is adapted in the call to reformatRawStringLiteral).

1015 ↗(On Diff #109554)

I'd have expected the raw string formatting to happen here, as breakProtrudingToken is very similar (semantically) to formatting raw strings.

1286–1289 ↗(On Diff #109554)

Why do we need to estimate the penalty, as opposed to getting it from the recursive formatting step?

1325–1326 ↗(On Diff #109554)

Why do we need an extra cache and can't let the State take care of this?

lib/Format/ContinuationIndenter.h
167 ↗(On Diff #109554)

"A key in for" doesn't parse.

krasimir marked 4 inline comments as done.Aug 4 2017, 5:05 AM
krasimir added inline comments.
lib/Format/ContinuationIndenter.cpp
83 ↗(On Diff #109554)

The delimiter could be the empty string. Added a comment for that.

95 ↗(On Diff #109554)

No. The tokenizer is pretty liberal and will recognize R"pb(text)" as a string literal, for example (this is because of some handling of the strings occurring as header names). Added a test for that.

krasimir added inline comments.Aug 4 2017, 5:05 AM
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)

There are two parameters in State that are initialized: State.FirstIndent, which is always initialized as FirstIndent and Stare.Column, which is initialized appropriately.

The raw string formatting introduces the ability to format a piece of code as if it's first line started at FirstStartColumn, and this new parameter needs to be propagated appropriately where it matters.

1015 ↗(On Diff #109554)

I explored this approach initially. The problem is that breakProtrudingToken internally assumes that the token has been put on the state (the protruding part) and reformatting conceptually takes place before that.

1286–1289 ↗(On Diff #109554)

That would be the other option. I went with the simplest approach.
Here's some problems with getting the penalty from the recursive step:

  1. This breaks the abstraction of reformat. Conceptually, the penalty stuff is an implementation detail of the Optimizing Formatter. For other implementations of Formatter-s, we don't have a meaningful penalty.
  2. The penalty is computed in the context of a different FormatStyle, so it might be incompatible with the way the penalties are computed in the current context.
1325–1326 ↗(On Diff #109554)

How can the State take care of this? We want to save the edits to the raw string text so that we don't call reformat again in non-dry-run-mode. Conceptually, there might be different State-s having the same FormattedRawStringCacheKey, where the edits could be reused.

krasimir updated this revision to Diff 109712.Aug 4 2017, 5:05 AM
krasimir marked 3 inline comments as done.
  • Address review comments
krasimir updated this revision to Diff 109713.Aug 4 2017, 5:07 AM
krasimir marked an inline comment as done.
  • Update comments
klimek added inline comments.Aug 4 2017, 6:26 AM
lib/Format/ContinuationIndenter.cpp
95 ↗(On Diff #109554)

I mean, that's a string literal, just not a raw string literal :) I'm wondering why the tokanizer doesn't have to make that distinction. Anyway, if it can happen, of course this looks fine.

138 ↗(On Diff #109554)

I thought that's what FirstIndent means? I'm still really confused :)

1015 ↗(On Diff #109554)

Why? I guess my question is: Where do the side effects of breakProtrudingToken and reformatRawStringLiteral differ?

1286–1289 ↗(On Diff #109554)

(2) is an interesting argument. I don't see why (1) holds, though, as we're already doing recursive formatting for child blocks within statements. I'd like to get Daniel's opinion on this point.

1325–1326 ↗(On Diff #109554)

What's the performance difference there?

djasper added inline comments.Aug 4 2017, 6:34 AM
lib/Format/ContinuationIndenter.cpp
1286–1289 ↗(On Diff #109554)

Ok, as for the function itself: It needs a better description of what the estimation does and why.

For the points:

  1. The optimizing formatter can compute the penalty. If the NoLineBreakFormatter is used, then the penalty should be 0 (there should be no line break or column limit violation). The NoColumnLimitFormatter doesn't care about penalties at all, so we can also assume 0.
  1. In what way can it be incompatible?
krasimir marked an inline comment as not done.Aug 4 2017, 8:57 AM
krasimir added inline comments.
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)

FirstIndent is the expected indent of the first token of the unwrapped line, which gets propagated, for example, in line breaks. The formatter itself controls it.
FirstStartColumn is where this line starts. The formatter doesn't control it.

Consider:

void f() {
  auto s = R"pb(key1: value1,
      key2: value2,
      key3: value3)pb";
}

To reformat the raw string literal, we need to take into account both the column where key1 starts (that is FirstStartColumn is 16) and the indent we'd like the first unwrapped line to have (that is FirstIndent is 6).

Maybe we need to rename these and add comments about them?

1015 ↗(On Diff #109554)

You're right. This belongs to breakProtrudingToken. I'll move this to an implementation of BreakableToken.

1286–1289 ↗(On Diff #109554)

I have another point about switching to use the recursive formatting of the penalties:

  1. The penalty for the last line of a multiline token is handled at a later stage. If we use the penalty recursively, that will include the penalty for the last line, so we'll need to somehow propagate to the later stage that we don't wanna handle the penalty for the last line there.

As an example for (2) penalty incompatibility, you could use a style inside the raw strings that has different PenaltyExcessCharacter, for example.

Also, consider another possible use-case of formatting SQL raw string literals, for example. clang-format doesn't understand how to format these and could possibly only touch the leading indent of the lines. In that case, the penalty estimation would work.

1325–1326 ↗(On Diff #109554)

Without the cache, the edits to each raw string token are computed at least twice: once in DryRun mode, once in wet mode.
So, I can come up with an example where this cache makes formatting twice faster.

Switching the implementation to a BreakableToken would eliminate the need of the cache, since the edits will be stored in the BreakableToken itself.

krasimir updated this revision to Diff 109878.Aug 5 2017, 9:14 AM
  • Move reformatRawStringLiteral to breakProtrudingToken
krasimir added inline comments.Aug 5 2017, 9:15 AM
lib/Format/ContinuationIndenter.cpp
1325–1326 ↗(On Diff #109554)

I was wrong. I moved the implementation to breakProtrudingToken, but that doesn't eliminate the need for the cache.

krasimir updated this revision to Diff 109884.Aug 5 2017, 11:17 AM
  • Document estimatedPenalty
krasimir updated this revision to Diff 109885.Aug 5 2017, 11:24 AM
krasimir marked 2 inline comments as done.
  • Fix formatting around a newly added block
krasimir added inline comments.Aug 5 2017, 11:30 AM
lib/Format/ContinuationIndenter.cpp
1309 ↗(On Diff #109885)

@djasper, @klimek: what's your opinion about this decision? It seems logical, but most of the examples I see in the wild are formatted more like State.FirstIndent + Style.IndentWidth.

1286–1289 ↗(On Diff #109554)

Added a description for estimatedPenalty.

klimek added inline comments.Aug 7 2017, 1:35 AM
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)

The same thing is true for blocks inside expressions, though - what's the difference?

1015 ↗(On Diff #109554)

Wait wait :) I'm not arguing it needs to be using the same implementation. I'm just saying it should be at the same spot in the code.

1325–1326 ↗(On Diff #109554)

What's the performance difference without the cache?

krasimir added inline comments.Aug 7 2017, 2:52 AM
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)

I don't know how blocks inside expressions are handled. Could you please give me an example or point me to that code?

1325–1326 ↗(On Diff #109554)

Without the cache, the edits to each raw string token are computed at least twice: once in DryRun mode, once in wet mode.
So, I can come up with an example where this cache makes formatting twice faster.

klimek added inline comments.Aug 7 2017, 3:09 AM
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)
void f() {
  otherfunc([] {
    innerStatement();
    auto letsgo = [] {
      deeper();
    };
  });
}
krasimir added inline comments.Aug 7 2017, 3:46 AM
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)

Thank you! Here's how formatting of blocks inside expressions works now:
Consider this formatted code (with column limit 20):

a = [] {
  bbbb = 1;
};

This is parsed as two unwrapped lines, one like a = [] {}; and a child line attached to the { token like bbbb = 1;. Since these are two independent lines, they have two different values of FirstIndent: 0 and 2, respectively. Also, since these are on two separate unwrapped lines, the formatting of the { is independent of the formatting inside (except for the first indent that gets propagated through).

The situation with raw string literals is a bit different. Consider the following formatted code (with column limit 20):

auto s = R"pb(key1: value1,
    key2: value2,
    key3: value3)pb";

This is parsed as a single unwrapped line, like auto s = R"pb(...)pb"; with a single value of FirstIndent: 0. We need to somehow propagate the information about the additional indent that a line break at line level 0 inside the raw string literal would get, and that's how we end up with an additional parameter. From the point of view of the surrounding formatting, the formatting of the raw string literal is a black box returning replacements (and possibly additional penalty). So the surrounding formatting code needs to inform the raw string formatting code about the context in which the formatting occurs, which is achieved by passing the newly introduced FirstStartColumn, NextStartColumn and LastStartColumn parameters.

klimek added inline comments.Aug 7 2017, 4:43 AM
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)

I'm wondering whether the inner indent should come from the style of the inner formatter. I'm not sure yet. I think if it shouldn't, that's a good argument for passing it through (and we need to document that decision).

1286–1289 ↗(On Diff #109554)

As long as the penalty is applied consistently for every different possibility, and just adds up, won't it still be correct (in selecting the best solution), even if we have totally different excess char penalties? Can you give an example where that's not true?

1325–1326 ↗(On Diff #109554)

I understand that without the cache it's going to be evaluated twice. We do that in lots of cases, though. Which is why I'm curious how much difference in runtime this actually makes (we usually dry-run a bazillion of states and then apply only one, so the dry- vs. non-dry-run runtime usually doesn't matter).

krasimir added inline comments.Aug 7 2017, 6:29 AM
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)

The inner indent already comes from the style of the inner formatter. The offset where the inner column 0 starts, however, is controlled by the outer formatter.
Consider this code:

string s = R"pb(item: 1,
    msg_field {
      key: value
    })pb";

void f() {
  string s = R"pb(item: 1,
      msg_field {
        key: value
      })pb";
}

The 4 column relative indent of msg_field is the ContinuationIndentWidth of the outer format style.
The additional 2 column relative indent of key comes from the IndentWidth of the inner format style.
The outer formatter sees this code as:

string s = R"pb(<...stuff that the inner formatter will take care of>
####<...stuff that the inner formatter will take care of>
####<...stuff that the inner formatter will take care of>
####<...stuff that the inner formatter will take care of>)pb";

void f() {
  string s = R"pb(<...stuff that the inner formatter will take care of>
######<...stuff that the inner formatter will take care of>
######<...stuff that the inner formatter will take care of>
######<...stuff that the inner formatter will take care of>
######})pb";
}

where # signifies the offset where column 0 starts in the raw strings.
I'll refine the information about this new option with a set of rules that roughly determine the indentation decisions taken.

1286–1289 ↗(On Diff #109554)

Good point. Thank you!
I can't come up with an example.

1325–1326 ↗(On Diff #109554)

Do you have any ideas on how to gather some meaningful measurements on the runtime impact of this patch? I'm thinking of running it over a bunch of real-world source code that does proto processing and noting the runtime.

klimek added inline comments.Aug 7 2017, 6:41 AM
lib/Format/ContinuationIndenter.cpp
1325–1326 ↗(On Diff #109554)

Yep, that's exactly what I'd expect. Also try to find some "worst case" scenarios, where you think you have an example that would benefit the most from the caching and measure that - that often gives a good upper bound.

krasimir added inline comments.Aug 9 2017, 2:30 AM
lib/Format/ContinuationIndenter.cpp
1325–1326 ↗(On Diff #109554)

The cache adds ~4% performance improvement on a bunch of real-world examples.

krasimir updated this revision to Diff 110342.Aug 9 2017, 3:25 AM
  • Get the penalty from the recursive invocation
krasimir updated this revision to Diff 110343.Aug 9 2017, 3:42 AM
  • Update the section in moveStateToNextToken to match the original
krasimir updated this revision to Diff 112901.Aug 28 2017, 8:36 AM
  • Updated indent to IndentWidth
  • Revised the rules for formatting the prefix and suffix
krasimir marked 3 inline comments as done.Aug 29 2017, 2:52 AM

I think this is ready for another round of reviews.

klimek added inline comments.Sep 5 2017, 12:29 AM
lib/Format/ContinuationIndenter.cpp
138 ↗(On Diff #109554)

Ok, so the problem is that if we want to start an inner formatter, FirstStartColumn will be the column at which we currently are at the state of the parent formatter?
(If that's the case, please document it :)

1325–1326 ↗(On Diff #109554)

I'm not sure whether that's worth it, but I'm happy to let Daniel tip the scale.

krasimir added inline comments.Sep 5 2017, 6:56 AM
lib/Format/ContinuationIndenter.cpp
1325–1326 ↗(On Diff #109554)

I believe that the value of the cache is that it shows explicitly and mechanically the exact state that is propagated to the inner formatter. As such, I would argue that the cache increases maintainability.

krasimir updated this revision to Diff 113853.Sep 5 2017, 7:29 AM
  • Rebase with master
djasper added inline comments.Sep 5 2017, 7:45 AM
lib/Format/ContinuationIndenter.cpp
1325–1326 ↗(On Diff #109554)

I agree that the performance gain aren't worth the additional effort here. We would need a cache to prevent computing the same breaks in various paths of the combinatorial formatting, but I haven't studied the code enough to know whether we have this separately yet.

On maintainability:
We similarly don't do this in other places (breakProtrudingToken, formatChildren, etc.) and so I think doing this differently for raw string literals will make things less maintainable. Moreover, I see your point that caching makes it in some way easier to see that we are doing the exact same thing in the second phase. At the same time ensuring that caching is correct might be hard to maintain. E.g. finding bugs that are caused by us storing incorrect information in the cache or something might be extraordinarily hard to track down.

krasimir updated this revision to Diff 114552.Sep 11 2017, 2:35 AM
  • Fix a regression after preprocessor indentation
krasimir updated this revision to Diff 114553.Sep 11 2017, 2:42 AM
  • Remove raw string cache
krasimir marked 4 inline comments as done.Sep 11 2017, 2:46 AM
krasimir updated this revision to Diff 114555.Sep 11 2017, 2:50 AM
  • Document FirstStateColumn @ getInitialState
krasimir marked an inline comment as done.Sep 11 2017, 2:50 AM

I think this is ready for another round of reviews.

klimek accepted this revision.Oct 26 2017, 10:01 AM

Needs a bit more comments, otherwise LG.

include/clang/Format/Format.h
1730–1734 ↗(On Diff #114555)

Needs a comment. Also, this should probably go into an internal namespace somewhere below all the public interfaces?

lib/Format/ContinuationIndenter.cpp
1324 ↗(On Diff #114555)

No else after return.

lib/Format/TokenAnalyzer.h
75–79 ↗(On Diff #114555)

I think you'll want to document these.

This revision is now accepted and ready to land.Oct 26 2017, 10:01 AM
krasimir updated this revision to Diff 120481.Oct 26 2017, 1:55 PM
krasimir marked 2 inline comments as done.
  • Address review comments
krasimir marked an inline comment as done.Oct 26 2017, 1:55 PM
krasimir added inline comments.
include/clang/Format/Format.h
1730–1734 ↗(On Diff #114555)

Good point. Moved to a new internal header FormatInternal.h and added documentation.

I'll need a bit of refactoring on the test side of things, since old gcc doesn't like raw string literals in macros.

klimek added inline comments.Oct 27 2017, 2:00 AM
lib/Format/FormatInternal.h
27–28 ↗(On Diff #120481)

"and at which column the fragment's last line should end".

I don't understand the "if it is a newline" part.

29–30 ↗(On Diff #120481)

"specifies the column at which the first line of \p Code starts".

33–34 ↗(On Diff #120481)

"the column at which the last line of \p Code should end".

Also, s/it/if/. Does "just a newline" mean "is empty"?

krasimir updated this revision to Diff 120635.Oct 27 2017, 10:11 AM
krasimir marked 3 inline comments as done.
  • Improve comments
krasimir updated this revision to Diff 120637.Oct 27 2017, 10:19 AM
  • Refactor tests to use a function instead of a macro
krasimir updated this revision to Diff 120638.Oct 27 2017, 10:21 AM
  • Update example
This revision was automatically updated to reflect the committed changes.