This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Break long string literals in C#, etc.
ClosedPublic

Authored by sstwcw on Jun 29 2023, 8:35 AM.

Details

Summary

Now strings that are too long for one line in C#, Java, JavaScript, and
Verilog get broken into several lines. C# and JavaScript interpolated
strings are not broken.

A new subclass BreakableStringLiteralUsingOperators is used to handle
the logic for adding plus signs and commas. The updateAfterBroken
method was added because now parentheses or braces may be required after
the parentheses or commas are added. In order to decide whether the
added plus sign should be unindented in the BreakableToken object, the
logic for it is taken out into a separate function
shouldUnindentNextOperator.

The logic for finding the continuation indentation when the option
AlignAfterOpenBracket is set to DontAlign is not implemented yet. So in
that case the new line may have the wrong indentation, and the parts may
have the wrong length if the string needs to be broken more than once
because finding where to break the string depends on where the string
starts.

The preambles for the C# and Java unit tests are changed to the newer
style in order to allow the 3-argument verifyFormat macro. Some cases
are changed from verifyFormat to verifyImcompleteFormat because those
use incomplete code and the new verifyFormat function checks that the
code is complete.

Diff Detail

Event Timeline

sstwcw created this revision.Jun 29 2023, 8:35 AM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptJun 29 2023, 8:35 AM
sstwcw requested review of this revision.Jun 29 2023, 8:35 AM
sstwcw retitled this revision from [clang-format] Break strings in Verilog to [clang-format] Break long strings in Verilog.Jun 30 2023, 9:07 PM

I'd really prefer you put what you need to modify mutable instead of removing the const from everything else. But that's just my opinion.

clang/include/clang/Format/Format.h
2012

It definitely works in C++, and I'd suspect C# and Java too.

clang/lib/Format/BreakableToken.h
313
sstwcw updated this revision to Diff 536555.Jul 1 2023, 3:28 PM
  • Address comments
sstwcw marked 2 inline comments as done.Jul 1 2023, 3:29 PM

I'd really prefer you put what you need to modify mutable instead of removing the const from everything else. But that's just my opinion.

Why should the insertBreak method be const?

sstwcw added inline comments.Jul 1 2023, 3:32 PM
clang/include/clang/Format/Format.h
2021–2033

It looks like they made a mistake in this example. The second line gets indented by a continuation indentation both before and after this patch.

I'd really prefer you put what you need to modify mutable instead of removing the const from everything else. But that's just my opinion.

Why should the insertBreak method be const?

Because it shouldn't and up until now doesn't modify the BreakableToken the modifying changes happen in the WhitespaceManager.

clang/include/clang/Format/Format.h
2048

I'd drop that, or replace it with a nicer sentence which only speaks of "all other languages", because right now you are missing at least protobuf and json.

sstwcw added a comment.Jul 6 2023, 9:11 PM

If it is expected that the method does not modify the field, and the expectation is not fulfilled, I prefer to make it clear rather than to pretend that the method works as expected. Can you help me come up with an alternative? Braces need to be added around the string literal when it gets broken into parts. Only one pair of braces need to be added no matter how many times the string gets broken. My solution was to use a mutable field to mark whether braces have already been added. Do you have another way?

sstwcw updated this revision to Diff 537975.Jul 6 2023, 9:17 PM
sstwcw marked an inline comment as done.
  • Drop the line in the doc
sstwcw edited the summary of this revision. (Show Details)Jul 14 2023, 6:50 PM
sstwcw updated this revision to Diff 540611.Jul 14 2023, 6:51 PM
sstwcw edited the summary of this revision. (Show Details)
  • Add back the const qualifiers for methods
clang/lib/Format/ContinuationIndenter.cpp
2294

This const can be readded, or not?

clang/lib/Format/WhitespaceManager.cpp
1522

Can you elaborate where the 2 replacements come from, and why we can't make sure there is no empty range?

sstwcw updated this revision to Diff 540752.Jul 15 2023, 7:43 PM
  • Add back the const qualifier
sstwcw marked an inline comment as done.Jul 15 2023, 7:46 PM
sstwcw added inline comments.
clang/lib/Format/ContinuationIndenter.cpp
2294

I forgot to add it.

sstwcw marked an inline comment as done.Jul 20 2023, 7:55 AM

Are there any more problems?

Apart from that I'm okay.

clang/lib/Format/WhitespaceManager.cpp
1522

This is open (at least for me).

sstwcw marked 2 inline comments as done.Jul 21 2023, 7:55 AM
sstwcw updated this revision to Diff 542916.Jul 21 2023, 7:55 AM
  • Add comment explaining the replacements

For the replacement part, I refer to @owenpan and/or @MyDeveloperDay hoping they have more insight.
The rest look good to me.

@owenpan What do you think about this revision especially the replacement part?

MyDeveloperDay accepted this revision.Jul 29 2023, 6:17 AM
This revision is now accepted and ready to land.Jul 29 2023, 6:17 AM
owenpan added inline comments.Jul 29 2023, 2:16 PM
clang/lib/Format/ContinuationIndenter.cpp
2192–2193

Can we support using + first?

@owenpan What do you think about this revision especially the replacement part?

See D154093#4544495. Then we can extend it by using , instead of + for Verilog (plus inserting a pair of braces).

sstwcw updated this revision to Diff 547619.Aug 6 2023, 5:47 PM

Add support for breaking strings with + in C# and others

sstwcw retitled this revision from [clang-format] Break long strings in Verilog to [clang-format] Break long string literals in C#, etc..Aug 6 2023, 5:48 PM
sstwcw edited the summary of this revision. (Show Details)

@owenpan What do you think about this revision especially the replacement part?

See D154093#4544495. Then we can extend it by using , instead of + for Verilog (plus inserting a pair of braces).

I added support for that. See the updated description. It turned out
that because inserting plus signs may involve inserting parentheses, all
the replacement stuff are still needed. So I didn't split the Verilog
stuff into a separate patch.

sstwcw marked an inline comment as done.Aug 6 2023, 5:49 PM
sstwcw added a comment.Aug 6 2023, 5:53 PM

@MyDeveloperDay Can you take a look at the last test in FormatTestCSharp.Attributes? I tried to enable breaking string literals by default. The long string got broken.

MyDeveloperDay added inline comments.Aug 7 2023, 1:05 AM
clang/unittests/Format/FormatTestCSharp.cpp
221

not sure what changing here and why?

clang/unittests/Format/FormatTestJS.cpp
1575

out of interest what happens with <backtick>xxxxxx ${variable} xxxxxx<backtick> type strings? can we add a test to show we won't break in the middle of a ${vas} even though the column limit might be reached?

sstwcw updated this revision to Diff 547767.Aug 7 2023, 6:49 AM
sstwcw marked 2 inline comments as done.
  • Add tests for code in JavaScript template strings
clang/unittests/Format/FormatTestCSharp.cpp
221

The brace is not paired. The old verifyFormat function didn't care about it. In FormatTestBase.h there is the function verifyFormat which requires the code to be complete and verifyIncompleteFormat which doesn't.

clang/unittests/Format/FormatTestJS.cpp
1575

There can be a break inside the braces. I added a test.

sstwcw updated this revision to Diff 547770.Aug 7 2023, 6:52 AM
  • Add tests for code in JavaScript template strings
MyDeveloperDay accepted this revision.Aug 8 2023, 12:28 AM

I'm ok with this if @owenpan is

This revision was landed with ongoing or failed builds.Aug 23 2023, 8:24 PM
This revision was automatically updated to reflect the committed changes.

@sstwcw this patch caused a crash in FormatTestJS.StringLiteralConcatenation:

Assertion failed: ((!RequiresNullTerminator || BufEnd[0] == 0) && "Buffer is not null terminated!"), function init, file MemoryBuffer.cpp, line 54.

Can you revert it if you can reproduce the crash?

This was causing failures on a large number of Linaro's bots. Judging by how widespread it was, reproducing could be as simple as adding -DLLVM_ENABLE_ASSERTIONS=True to your CMake config.

In the JavaScript tests that I added, it was wrong to use SmallString. Would you prefer me changing it to string or expanding the 6 test cases so we don't need a variable for the string?

Shortly after I committed this patch, the server builds caught the problem in the JavaScript tests. When the program reads a file, a null byte is added at the end. The program relies on this property because it means it doesn't have to do a bound check on the source string, including in some of the code I wrote though this time it was another piece of code that caught the problem. Also the null byte is not copied when a SmallString is initialized from a const char*. A null byte will be added if the c_str method is used at least once. But it was not the case here. So my test ended up passing a string which does not have a null byte at the end. I didn't catch the problem when I ran the tests myself because ironically I had the address sanitizer enabled. The address sanitizer didn't catch the problem because the byte past the end of the string in the SmallString is the struct padding which is still inside the object as far as the address sanitizer is concerned. However the address sanitizer likes to add padding between variables. So now the uninitialized part of the struct padding is the padding null bytes added by the address sanitizer instead of the variable that used to be there. So the assertion that the byte is null now holds.

In the JavaScript tests that I added, it was wrong to use SmallString. Would you prefer me changing it to string or expanding the 6 test cases so we don't need a variable for the string?

You mean std::string? That would be fine by me.

sstwcw reopened this revision.Aug 26 2023, 7:02 AM
This revision is now accepted and ready to land.Aug 26 2023, 7:02 AM
sstwcw updated this revision to Diff 553729.Aug 26 2023, 7:04 AM

Stop using SmallString in the tests.

This revision was automatically updated to reflect the committed changes.
eaeltsin added a subscriber: eaeltsin.EditedSep 12 2023, 11:47 AM

This introduces an invalid TS transformation, from

type x = 'ab';

to

type x = ('a'+'b');

Which doesn't compile - https://www.typescriptlang.org/play?#code/C4TwDgpgBAHlC8UDkBDARkg3AKG6SUICUAFKklANTIYCUOQA

Should we revert?

alexfh added a subscriber: alexfh.Sep 12 2023, 3:34 PM

This patch makes clang-format produce invalid JS/TS code

This introduces an invalid TS transformation, from

type x = 'ab';

to

type x = ('a'+'b');

Which doesn't compile - https://www.typescriptlang.org/play?#code/C4TwDgpgBAHlC8UDkBDARkg3AKG6SUICUAFKklANTIYCUOQA

Should we revert?

Similarly, strings in dictionary literals can't be broken: var w = {'a' + 'b': 0}; (https://gcc.godbolt.org/z/6renEv5sa). Please fix or revert.

This patch makes clang-format produce invalid JS/TS code

This introduces an invalid TS transformation, from

type x = 'ab';

to

type x = ('a'+'b');

Which doesn't compile - https://www.typescriptlang.org/play?#code/C4TwDgpgBAHlC8UDkBDARkg3AKG6SUICUAFKklANTIYCUOQA

Should we revert?

Similarly, strings in dictionary literals can't be broken: var w = {'a' + 'b': 0}; (https://gcc.godbolt.org/z/6renEv5sa). Please fix or revert.

Here is the fix. https://github.com/llvm/llvm-project/pull/66168

This still misses contexts when string literal is required, for example https://github.com/search?q=repo%3Agoogle%2Fclosure-compiler%20%22%20must%20be%20a%20string%20literal%22&type=code

I wonder, if splitting the literal with + is a good option at all.

uabelho removed a subscriber: uabelho.Sep 14 2023, 6:57 AM

This might have another issue with Verilog -

<                 import "AAA-BBB" foo bar baz
---
>                 import {"AAA-",
>                         "BBB"} foo bar baz

I wonder if Verilog allows breaking strings in import?

This might have another issue with Verilog -

<                 import "AAA-BBB" foo bar baz
---
>                 import {"AAA-",
>                         "BBB"} foo bar baz

I wonder if Verilog allows breaking strings in import?

From the spec:
dpi_spec_string ::= "DPI-C" | "DPI"
Seems like it can't.

This might have another issue with Verilog -

<                 import "AAA-BBB" foo bar baz
---
>                 import {"AAA-",
>                         "BBB"} foo bar baz

I wonder if Verilog allows breaking strings in import?

From the spec:
dpi_spec_string ::= "DPI-C" | "DPI"
Seems like it can't.

Sent out https://github.com/llvm/llvm-project/pull/66951