This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] [PR44542,38872] String << String always get a forced newline.
AbandonedPublic

Authored by MyDeveloperDay on Jun 1 2020, 1:41 PM.

Details

Summary

https://bugs.llvm.org/show_bug.cgi?id=44542
https://bugs.llvm.org/show_bug.cgi?id=38872
(I'm sure I've seen others)

Users really don't like clang-format going off on one and breaking lines without their control, this piece of code which I'm removing dates all the way back to 2013, I think it was added because it people think

os << "from"
   << "asdf"
   << "def"
   << "ghi"
   << "ghi";

is more pleasing.. however, when you start using other types, then this just becomes crazy... and leaves the user wondering why. (see bugs for other odd cases)

os << "from"
   << "def" << 1 << "ghi"
   << "lmn"
   << "opq" << endl
   << "abc"
   << "ghi";

The rules I'm removing was to FORCE (MustBreak) a line break between "<String>" << "<String>" but not between "<String>" << AnyOtherType or AnyOtherType << "<String>"

This might be considered too much for a change, somehow breaking compatibility but I feel it's wrong and really plays to the contempt people have towards clang-format that it goes off and formats things how it wants to without obeying its own line limit rules.

This may need a configuration switch as it's likely to cause clang-format changes in already formatted code (as here with UnwrappedLineParser) (Polly also breaks)

...I also noticed no unit tests broke when I made this change...

Diff Detail

Event Timeline

MyDeveloperDay created this revision.Jun 1 2020, 1:41 PM

The change seems to me technically sound, but I'm not sure of the scope of its effects. There might be users that rely on this behaviour. On the other hand, adding an option to keep the old behaviour doesn't seem appropriate, and personally I consider the old behaviour as a bug.

The change seems to me technically sound, but I'm not sure of the scope of its effects. There might be users that rely on this behavior. On the other hand, adding an option to keep the old behavior doesn't seem appropriate, and personally I consider the old behavior as a bug.

I also have concerns about my own fix. This will break existing code similar to that found in UnwrappedLineParser.cpp that I included here, whilst I also think this is a bug, I fear users will complain that we broke compatibility. The fact it lacked tests was also a shock given how vigorously that is defended when people add new functionality.

One thing that perhaps we could consider is keeping the old odd behavior but behind a new configuration option, declaring it an outright bug and removing it might just be a more subjective than an objective opinion (if I think about the longevity of this 7+year and remove my myopic view that its a bug I can see that some code could be made less readable by this change)

What do people think about if we added

enum ChevronBreakingStyle
{
    Never,  // at line limit only
    LegacyBetweenStrings, // (current)
    AfterEndOrNewLine,   // we could break on endl/std::endl/ends/std::ends/"\n" or any string literal ending with \n
    Always  // like between strings but between all  << types   and  << "string"
}

Even if the default was LegacyBetweenStrings, I think those that are concerned about how it currently works would likely be happier to make the change to get the behavior they wanted, that way we wouldn't even break anywone.

I would have nothing against if you'd added this option and we kept current behaviour by default.
The only drawback is the (bit of) complexity it adds bit that seems justified to me.

MyDeveloperDay planned changes to this revision.Jun 3 2020, 6:18 AM
sammccall added a comment.EditedJun 3 2020, 8:10 AM

(Oops, hit enter too soon here... will edit the rest in)

Lack of tests is bad :-( This is http://github.com/llvm/llvm-project/commit/2603ee0dc6003 which dates back to clang-format early days, before there was good test coverage.

Obviously we can add tests for the existing behaviour, not sure whether it changes the analysis though.
FWIW at google we release clang-format from head weekly and regression test it by looking for changes in formatting of sampled files from a large codebase.
Certainly this isn't as good as unit-test coverage, though it has caught many bugs/style regressions in the past.

https://bugs.llvm.org/show_bug.cgi?id=44542

This irregularity is indeed annoying. Probably a lot of the time you want to break after D... unless it's Indent or so.

https://bugs.llvm.org/show_bug.cgi?id=44542

This example is not good evidence of anything IMO, it's just another not-really-function-like macro, with no evidence that this pattern is widely used.

IMO this *is* actually a good heuristic: << between string literals indicates some intent to break, and when the stream is entirely literals the formatting is good.
(@krasimir threw this into our regression testing, and where the formatting was different, it was always worse than before).

e.g. before:

LOG(ERROR) << "(" << foo << ")"
            << " The printer " << bar << " is on fire.";

after:

LOG(ERROR) << "(" << foo << ")" << " The printer " << bar
           << " is on fire.";

or before:

LOG_IF(DFATAL, exp_await_calls < num_await_calls_)
    << "Equality condition can never be satisfied:"
    << " exp_await_calls=" << exp_await_calls
    << " num_await_calls_=" << num_await_calls_;

after:

LOG_IF(DFATAL, exp_await_calls < num_await_calls_)
    << "Equality condition can never be satisfied:" << " exp_await_calls="
    << exp_await_calls << " num_await_calls_=" << num_await_calls_;

However the fact that we can't always infer the intent to break leads to some annoying regularities, and it's hard to imagine heuristics could catch every case. (I'm thinking about looking at leading/trailing characters in string literals, but that's going to be really confusing when it goes wrong).

I guess it's not the end of the world to lose this behavior, but would certainly prefer to keep it at least for google style.

(Incidentally, the fact that the google style guide leans towards using iostreams mostly only for logging-type stuff might be relevant to the types of examples that are typical)

IMO this *is* actually a good heuristic: << between string literals indicates some intent to break, and when the stream is entirely literals the formatting is good.

I take your point, how do you feel about an option even if the default was "BreakBetweenStrings"?

@sammccall in the example

LOG_IF(DFATAL, exp_await_calls < num_await_calls_)
    << "Equality condition can never be satisfied:"
    << " exp_await_calls=" << exp_await_calls
    << " num_await_calls_=" << num_await_calls_;

I agree this is more pleasing, however Its ONLY broken this was because of your LineLimit and the variable lengths

The same function with smaller variables isn't quite so readable

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << " bar=" << b;

and there is inconsistency between

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << " is less than "
                      << " bar=" << b;
std::string reason = " is less than ";
LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << reason << " bar=" << b;

I could see value in allowing (AlwaysBreak) (as this really help to show the need for a space before the "foo=,is less and bar=")

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo="
                      << a
                      << " is less than "
                      << " bar="
                      << b;

and also not enforcing any breaking other than line length (Never)

std::string reason = " is less than ";
LOG_IF(DFATAL, a < b) << "=" << " foo=" << a << reason << " bar=" << b;

because that actually looks a little better IMHO than

std::string reason = " is less than ";
LOG_IF(DFATAL, a < b) << "="
                      << " foo=" << a << reason << " bar=" << b;

But whatever we do I think we should leave the current one as the default.

Just to add, I also think although because there are no tests to enforce it the primary reason is for when we see '\n', breaking the stream on endl but this is actually covered by another rule this also in mustBreak()

if (Current.is(tok::lessless) &&
    ((Previous.is(tok::identifier) && Previous.TokenText == "endl") ||
     (Previous.Tok.isLiteral() && (Previous.TokenText.endswith("\\n\"") ||
                                   Previous.TokenText == "\'\\n\'"))))
  return true;

Is going to break on what is most likely a natural heuristics where \n separates different lines

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << endl
                      << " is less than "
                      << " bar=" << b;

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << "\n"
                      << " is less than "
                      << " bar=" << b;

but this is also flawed depending on what precedes the newline

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << " and " << endl
                      << " is less than "
                      << " bar=" << b;

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << " and "
                      << "\n"
                      << " is less than "
                      << " bar=" << b;

I guess these rules were a good first compromise.. and more likely you'd want different rules depending on how large the streaming operations.

The same function with smaller variables isn't quite so readable

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << " bar=" << b;

FWIW, this looks fine to me, it's the split between = and variable I was objecting to motsly.
We could still get unlucky depending on line lengths:

LOG << "blah"
  << "foo=" << foo << " bar="
  << b;

In which case I'd probably split the space away from bar, writing the second line as "<< "foo=" << foo << " "...

and there is inconsistency between

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << " is less than "
                      << " bar=" << b;
std::string reason = " is less than ";
LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                      << " foo=" << a << reason << " bar=" << b;

Maybe I'm used to it, but that seems like a natural difference rather than an annoying inconsistency to me.
(The latter is clearly less readable with or without formatting, precisely *because* it's pasting together variables that we have to understand the semantics of to get any understanding).

std::string reason = " is less than ";
LOG_IF(DFATAL, a < b) << "=" << " foo=" << a << reason << " bar=" << b;

because that actually looks a little better IMHO than

std::string reason = " is less than ";
LOG_IF(DFATAL, a < b) << "="
                      << " foo=" << a << reason << " bar=" << b;

This is clearly a question of taste :-)

An option seems attractive here. We haven't met the bar of multiple competing published style guides, but here the historical behaviour is idiosyncratic but long-standing and does some things people want...

I'd ask that we call the existing behavior something flexible like Heuristic or so... e.g. today we break after endl, and we should probably extend that to strings ending in "\n" since endl is often silly.
(If I was actually in a position to hack on clang-format more, I'd want to look at having a lower penalty when strings end in = etc, to encourage semantic breaks)

But whatever we do I think we should leave the current one as the default.

I think we'd like to keep it for at least Google style. Feel less strongly about the others, but if an option is added we might as well separate the implementation from the flag-flip.

I couldn't quite parse the last comment about \n, but yes the endl handling (didn't realize it also covered \n!) is implemented separately but falls under the same "heuristic formatting" umbrella I think.

LOG_IF(DFATAL, a < b) << "Equality condition can never be satisfied:"
                     << " foo=" << a << " and "
                     << "\n"
                     << " is less than "
                     << " bar=" << b;

This formatting seems fine to me: I'm not sure why you'd write " and " << "\n" separately if you didn't want a break there.

Another weak argument about why the current rule makes sense (break before << between two string literals) is: suppose you don't want such a break. Then you can change your code to concatenate the two string literals into a single one, avoiding the problem altogether, if possible (this can be unacceptable in cases where the combined length of the two string literals is excessive, but at that point we will likely break because of the column limit; or when there is a chain of more than 2 string literals and the user wishes to keep them fluently reflowing as the code evolves.)

However, a similar argument applies in reverse: suppose clang-format did not force-break before << between two string literals. Then you can force a line break by adding an empty line comment. That seems to me like an OK tradeoff between regularity, consistency with non-string literals cases, and flexibility to influence the formatting in cases that are better off with line breaks.

Maybe this at its core is more of a communication/documentation issue, as I can totally see how confusing this heuristic is for users of clang-format.

I'm very slightly against adding a style option for this -- it feels too niche (how will we name it?) and it will leave users with a choice between suboptimal (in my opinion) cases. However adding an option seems like the best tradeoff short of coming up with a good heuristic.

MyDeveloperDay abandoned this revision.Sep 27 2022, 4:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2022, 4:44 AM