This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Recognize escape sequences when breaking strings
AbandonedPublic

Authored by sstwcw on Jun 29 2023, 8:32 AM.

Details

Summary

When a string literal is too long to fit within the column limit, the
formatter breaks it into several lines. When breaking a string, the
formatter has always preferred to break it where there is already
whitespace in the string. That is, if the string can be made to fit
in the column limit by being broken at a whitespace character, then it
will not be broken at a location that does not have a whitespace
character. Take for example the string "some text". Even if the
column limit allows for 6 characters on the first line, it will get
broken into "some " and "text". This has always been the case.

However, the program did not consider escape sequences when looking
for whitespace in the string to break it at. Say the original string
has an escape sequence instead of a space like "some\ttext" or
"some\ntext". Before this patch, the program treated the \t or
\n as an ordinary character. It could break the string into
"some\nt" and "ext".

After this patch, the program will treat escape sequences for
whitespace characters as preferred locations to break the string at
just like ordinary unescaped whitespace characters. So the string
"some\ntext" will get broken into "some\n" and "text" when it is
too long to fit in one line, even if the column limit allows
"some\nt" to fit on the first line. And also \n is preferred over
other whitespace. That is, if a string that needs to be broken into
several lines has any newline \n escape sequence in it, and breaking
it at one of the newline escape sequences makes it fit in the column
limit, then it will get broken at one of the newlines.

This patch only affects strings that are too long. A string that
already fits in the column limit still will not get broken, even if it
has \n in it.

Diff Detail

Event Timeline

sstwcw created this revision.Jun 29 2023, 8:32 AM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptJun 29 2023, 8:32 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
sstwcw requested review of this revision.Jun 29 2023, 8:32 AM
sstwcw added inline comments.Jun 29 2023, 8:35 AM
clang/lib/Format/BreakableToken.cpp
225

Should I change the threshold from 2 to 1?

It is set to 2 because it is the old behavior. There is already a test.

already in the test:
"some"
" tex"
"t"

I expect:
"some"
" "
"text"

is there a git issue for this, I'm struggling to understand the problem we are solving.

sstwcw retitled this revision from [clang-format] Prefer breaking long strings at new lines to [clang-format] Recognize escape sequences when breaking strings.Jun 30 2023, 9:04 PM
sstwcw edited the summary of this revision. (Show Details)

See the updated summary. There does not seem to be an issue on GitHub when I searched. I simply noticed it when reading the code in order to understand how strings got broken.

clang/lib/Format/BreakableToken.cpp
223

What if the new line is after the limit?

225

Since it was explicit, I'd keep it that way.

sstwcw marked 2 inline comments as done.Jul 1 2023, 8:31 AM
sstwcw added inline comments.
clang/lib/Format/BreakableToken.cpp
223

Then the loop will exit on line 197. I added a test for it.

sstwcw marked an inline comment as done.Jul 1 2023, 8:32 AM
sstwcw updated this revision to Diff 536531.Jul 1 2023, 8:34 AM
  • Add tests for newline outside of column limit

What would happen in a \r\n example

clang/lib/Format/BreakableToken.cpp
207

Are you testing \r \v \f

sstwcw updated this revision to Diff 537190.Jul 4 2023, 4:35 PM
  • Add tests for other escape sequences
sstwcw marked an inline comment as done.Jul 4 2023, 4:38 PM

What would happen in a \r\n example

If the entire \r\n part fits on the first line, the break is after \r\n. If only the part up to \r fits on the first line, the break is between \r and \n.

clang/lib/Format/BreakableToken.cpp
207

I am now.

sstwcw marked an inline comment as done.Jul 4 2023, 4:52 PM

What would happen for say unicode escapes?

x = "XXXThis is a string \u0041 With stuff after";

would it break after the \u?

x = "XXXThis is a string \u
0041 With stuff after";
MyDeveloperDay added inline comments.Jul 5 2023, 4:20 AM
clang/lib/Format/BreakableToken.cpp
199

Can we add a unit test for escape sequences > \X which I assume this handles

sstwcw added a comment.Jul 5 2023, 7:53 AM

It won't break the string inside an escape sequence. But escape sequences like \040, \x20, and \u0020 don't get recognized as whitespace.

clang/lib/Format/BreakableToken.cpp
199

There are already tests that verify that the string does not get broken inside an escape sequence in BreaksWideAndNSStringLiterals and DoNotBreakStringLiteralsInEscapeSequence. And escape sequences longer than 1 character following the backslash should not begin with one of these letters. So I don't see what tests I should add.

MyDeveloperDay added inline comments.Jul 5 2023, 10:08 AM
clang/lib/Format/BreakableToken.cpp
199

I was thinking about your tests having a test case which shows it shouldn't break, in case someone add case 'x' into your code, i.e. asserting the negative.

sstwcw updated this revision to Diff 537701.Jul 6 2023, 7:01 AM
  • Add tests for non-whitespace sequences
sstwcw marked 2 inline comments as done.Jul 6 2023, 9:18 PM
sstwcw abandoned this revision.Aug 23 2023, 8:05 PM

I am closing it since no one seems to like it.