Page MenuHomePhabricator

[Format] Add C++20 standard to style options
Needs ReviewPublic

Authored by modocache on Jul 20 2019, 5:35 PM.

Details

Summary

When C++ coroutines were adopted as part of the C++20 standard, a change
was committed in https://github.com/llvm/llvm-project/commit/10ab78e854f:
coroutine keywords such as co_yield went from being gated on the
presence of the -fcoroutines-ts flag, to instead being gated on
-std=c++2a. This resulted, perhaps unexpectedly to some users, in a
change in how coroutine keywords were formatted. Because libclangFormat
has only 3 options for formatting according to a language standard --
C++03, C++11, or "auto" -- and because it enabled C++20 keywords for all
settings aside from C++03, users who specified a standard of C++11 in
their style options would have their C++ formatted as if co_yield were
a keyword:

  • Before, C++03: co_yield ++i would be formatted as co_yield++ i
  • Before, C++11: co_yield ++i would be formatted as co_yield++ i
  • After, C++03: co_yield ++i would be formatted as co_yield++ i
  • After, C++11: co_yield ++i would be formatted as co_yield ++i

Although the "after" examples above appear like correct formatting
choices to those who are used to seeing coroutine keywords, I would
argue that they aren't technically correct, because a user may define a
variable in C++11 named co_yield, and they could increment that
variable by typing co_yield++. In this case, clang-format would change
the formatting, despite the user never opting-in to treating co_yield
as a keyword.

(There are other examples of clang-format suddenly formatting C++11 code
according to C++20 standards differently as a result of changes like
https://github.com/llvm/llvm-project/commit/10ab78e854f, and I've included
them as tests in this commit, but I won't go into detail explaining them
here.)

To give users the option of formatting according to the C++11 standard,
without the use of coroutines, I've added a style option for the C++20
standard (and, similarly to how the C++11 standard enables C++14, 17,
and 1z, I've written the documentation to indicate using the C++20
option enables any future C++2a standards).

In a future commit, I add a boolean style option to enable coroutines,
so that users may specify they wish to format according to the C++11
standard, but with coroutine keywords enabled.

Event Timeline

modocache created this revision.Jul 20 2019, 5:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2019, 5:35 PM

Friendly ping! I'm wondering, from the clang-format maintainers' point of view, whether a diff like this and https://reviews.llvm.org/D65044 make sense to add? I do think that it is useful for users to specify whether they wish to use C++11 or C++20 constructs, but I'm not an expert.

So there's precedent for this in rL185149 (D1028). @klimek may have historical context. Overall, this makes sense.
Since the coroutines flag flip landed on the 9.x release branch, we may want this merged into the branch too.

The configuration (and particularly defaults) is difficult though. I think due to some excessive cleverness in the past, we're dug into a bit of a hole.

  • the default behavior needs to work well for projects using c++20. Given ambiguous input (e.g. use of co_yield, with LS_Auto), we should resolve in favor of this being C++20 rather than a funny-named identifier. 5 years from now, the former is going to be much more common than the latter, and there's no better time than now to switch the default. The most obvious implementation of this is having LS_Auto run the lexer in C++20 mode, and requiring users who don't want that to specify C++1x.
  • By bucketing C++11/14/17 together, the LS option has effectively evolved into a tristate: "old/new/detect". The result is we're not capturing user intent well: lots of .clang-format files specify Cpp11 to mean "new". If we're going to add a Cpp20 option, our only sane alternative is to disable C++20 features for such people. It's not actually a backwards-incompatible change at this point (assuming we get this on the 9.x branch), but it's an unfortunate situation and it's our fault. We need to improve the system so users have better options, because next time (when C++23 rolls around) things will be more complicated. Or if we need to change the behavior for C++17 vs 11...

I think the user intents we should capture here are:

  1. pinning to any specific version, that's what the codebase uses and we have a process for bumping it
  2. (maybe) floating with the latest language version, the codebase continuously adopts new constructs and fixes conflicts in old code
  3. I don't know/haven't bothered to configure

The reason I'm unsure about 2 is I don't know whether the people setting Cpp11 today (in projects using post-C++11 features) have a specific standard in mind. Maybe we should be conservative and leave it out for now.

This would lead to something like:

enum LanguageStandard {
  // Default: attempt to guess based on the file.
  // Lexer always runs in the latest standard.
  // Preserves backwards-compatible formatting (e.g. vector<vector<int> >) if the
  // input uses it consistently.
  LS_Auto,

  LS_Cpp03, // Parse and format in C++03-compatible mode
  LS_Cpp11, // No longer enables C++14 in the lexer
  LS_Cpp14,
  LS_Cpp17, // Equivalent to today's Cpp11!
  LS_Cpp2a, // Later to be renamed to Cpp23, but we'll keep accepting `Cpp2a` in the config parser forever
};

The obvious problem here is that this may well break existing projects that use Cpp11 and C++14/17 features.
I think we can punt on this by making Cpp11/14 still enable 14/17 in LangOpts for now (so we don't change behavior, just give people a better way to specify their intent) and change it later if needed.

What do you think?

Peanut gallery says: A priori, I don't see any reason for clang-format's LanguageStandard options to diverge from Clang's own -std= options. It sounds like currently they're very different, and you're proposing to make them basically the same. I think that's a good thing.

GCC and Clang still treat "lack of any -std= option" as a synonym for "-std=c++03". A priori this is an absolutely terrible default, but there would be some logic in making clang-format follow their lead. The only sensible alternative, IMHO, would be for you to treat "lack of any -std= option" as a synonym for MSVC's -std:c++latest, which means "-std=c++2a today and -std=c++2b tomorrow."

clang/docs/ClangFormatStyleOptions.rst
2237

C++2a will be C++20, barring any radically unforeseen events. So saying "C++20 and C++2a" is redundant. Personally I would follow GCC/Clang's lead and say "C++2a" until the standard is actually out.

clang/include/clang/Format/Format.h
1879

Again, C++2a is likely a synonym for C++20.
Three lines earlier, you might want to change "C++1z" to "C++17" (and grep the codebase for other instances of "1z").

clang/unittests/Format/FormatTest.cpp
3812

This doesn't seem to test what you set out to test, does it? (x) <= > 5 isn't a valid C++ expression anyway. Maybe what you want to test here is that clang-format is willing to reformat pre-C++2a code

LongName<&LongerName::operator<=> x;

into

LongName<
    &LongerName::operator<=
> x;

(that is, that it's willing to insert a line break between <= and >). However, if clang-format refused to insert a line break in that one position even in C++11 mode, would anything of value really be lost?

Peanut gallery says: A priori, I don't see any reason for clang-format's LanguageStandard options to diverge from Clang's own -std= options. It sounds like currently they're very different, and you're proposing to make them basically the same. I think that's a good thing.

+1. (It'd be nice to canonically spell these 'c++11' instead of 'Cpp11', but that's a different patch.)
Exception: 'Auto' is important for clang-format but doesn't/shouldn't exist in clang.

GCC and Clang still treat "lack of any -std= option" as a synonym for "-std=c++03".

Clang actually defaults to c++14 since clang 6. (CompilerInvocation.cpp near CLANG_DEFAULT_STD_CXX)

A priori this is an absolutely terrible default, but there would be some logic in making clang-format follow their lead. The only sensible alternative, IMHO, would be for you to treat "lack of any -std= option" as a synonym for MSVC's -std:c++latest, which means "-std=c++2a today and -std=c++2b tomorrow."

It's important that clang-format works sensibly with minimal configuration on as much code as possible.
This means both:

  • when running on pre-11 code that uses vector<vector<int> > consistently, it should preserve that c++03-compatible syntax
  • when running on post-11 code that uses vector<vector<int>> in some places, it should fix the formatting to use it everywhere

This behavior is an important default, and doesn't correspond to any one version of C++. So neither "latest" nor "c++14" is an acceptable default. (Though c++14 should certainly be an option, and latest should possibly be one too)

MyDeveloperDay added inline comments.Wed, Jul 24, 9:02 AM
clang/lib/Format/Format.cpp
80

Prior to C++20 being confirmed and given the line 2373 being LangOpts.CPlusPlus2a, it feels a little wrong to introduce anything here other than Cpp2a into what is effectively the .clang-format file format.

In the rare event that Cpp2a becomes something other than 20, this would leave us with .clang-format files out there defining something that doesn't really exist, but regardless of the assignment of Cpp2a to what it might become

IO.enumCase(Value, "Cpp2a", FormatStyle::LS_Cpp2a);

will always be correct, meaning Clang-Format files with Cpp2a would be correct for whatever it became Cpp20 or Cpp21....

When 2a becomes something real, there will likely be a change in clang to change LangOptions.. this feels like the correct time to add

LS_Cpp20 ~= LS_Cpp2a

and 

IO.enumCase(Value, "Cpp2a", FormatStyle::LS_Cpp20);
IO.enumCase(Value, "Cpp20", FormatStyle::LS_Cpp20);
IO.enumCase(Value, "C++20", FormatStyle::LS_Cpp20);

along with the corresponding changes to all uses of LS_Cpp2a upgraded to the correct version.

modocache marked 3 inline comments as done.Wed, Jul 24, 10:03 AM

It sounds like currently they're very different, and you're proposing to make them basically the same. I think that's a good thing.

I 100% agree with this statement, and thank you @sammccall for the suggestion to move in this direction.

The obvious problem here is that this may well break existing projects that use Cpp11 and C++14/17 features. I think we can punt on this by making Cpp11/14 still enable 14/17 in LangOpts for now (so we don't change behavior, just give people a better way to specify their intent) and change it later if needed.

I like this idea. LS_Cpp14 and LS_Cpp17 are new options that we'll introduce and so clang-format should interpret them very precisely, whereas LS_Cpp11 (for now) will remain a little ambiguous. Hopefully some thorough documentation will prevent some user confusion here.

The codebase I'm applying formatting to uses -std=c++17 and -fcoroutines-ts to compile, but specifies LS_Cpp11 in its .clang-format file -- as you mentioned, the intent was to simply use "new" C++ formatting, for whatever clang-format decided "new" to mean. I'd like to reason "out loud" about what changes we can make to clang-format, and what my codebase's migration path would look like should we make those changes. Based on the discussion above I'm thinking we make these commits/changes to clang-format itself:

  1. Add the enum LanguageStandard @sammccall suggested above verbatim, thereby allowing users to specify an arbitrary standard such as LS_Cpp17. To split up the work for easier code review, we might want to consider just updating the .clang-format parsing to recognize "Cpp17", but internally still have clang-format treat it as LS_Cpp11. This is because step (2) is a bit of a doozy.
  2. Audit the existing uses of FormatStyle::LS_Cpp{03,11,Auto} in the clang-format codebase and modify them to instead check for the appropriate standard. We might want this work to land as several small commits.
    1. For example, formatting that should only be applied to C++17 should be modified, and instead of checking for LS_Cpp11, it should check for LS_Cpp17 || LS_Cpp11 (LS_Cpp11 is still included in the check because we're using that to mean "new"). Commits with changes like this can land without any impact to existing users.
    2. Formatting that should only be applied to C++20 should be modified to check for LS_Cpp20 instead of LS_Cpp11. These commits will change end behavior for users. In my codebase, for example, we would probably want to update our .clang-format, from "Cpp11" to "Cpp20", before we update to a clang-format that includes these commits, because otherwise our co_yield keywords will begin to get formatted incorrectly.
  3. Change .clang-format parsing to recognize the new enum cases like "Cpp17" as LS_Cpp17, not LS_Cpp11 (see item 1).
  4. Commit the CoroutinesTS option from D65044. My codebase would switch from LS_Cpp20 to LS_Cpp17 and CoroutinesTS and almost no formatting should change.

If the above series of patches sounds reasonable, I wouldn't mind working on some or all of them, so let me know!

clang/docs/ClangFormatStyleOptions.rst
2237

Thank you, I'll do that. I wasn't sure about the naming conventions. While we're talking about this, mind if I ask: why did "C++1z" use "z" whereas "C++2a" use "a"? Once C++20 is actually out, is the next version "C++2b", or does "C++2a" start referring to C++23? Sorry for the rookie questions.

clang/include/clang/Format/Format.h
1879

Cool, will do, thank you!

clang/unittests/Format/FormatTest.cpp
3812

Absolutely right, I thought of this test not as a signal of what I want to happen, but rather to demonstrate what does happen. I'll either update this test to make that clear, or add a test like the one you suggested instead.

Clang actually defaults to c++14 since clang 6. (CompilerInvocation.cpp near CLANG_DEFAULT_STD_CXX)

Today I learned. Thanks! :)

It's important that clang-format works sensibly with minimal configuration on as much code as possible.
This means both:

  • when running on pre-11 code that uses vector<vector<int> > consistently, it should preserve that c++03-compatible syntax
  • when running on post-11 code that uses vector<vector<int>> in some places, it should fix the formatting to use it everywhere This behavior is an important default, and doesn't correspond to any one version of C++.

Ah, I see. That use-case makes sense to me. So, the observable behavior of what you describe, would that be the same behavior as "first scan the file and auto-detect its version, then format it exactly as if that version had been explicitly specified"? I don't see how to achieve the behavior you described unless you start by scanning the file for ad-hoc instances of "C++11isms" such as vector<vector<int>>. If you find at least one such instance, then you're in the "post-11" case; otherwise you're (maybe) in the "pre-11" case. But that scanning process has basically no overlap with the formatting process, right? You might want to add another scan for ad-hoc "C++2a-isms" such as co_yield -1; or maybe something due to operator<=> or maybe something due to P0634 "Down with typename" or maybe something due to constinit. (I say "maybe" because all those examples seem extremely contrived to me.)

IMHO, AFAIK, C++2a has sufficient backward-compatibility in the parser, so that there does not need to be any difference between how clang-format formats C++11 code and how clang-format formats C++2a code. It is true that token-soups such as f(co_yield - 1); have different parse trees in '11 and '2a, but I don't think any sane user should complain if clang-format formats that soup as f(co_yield -1) even in C++11 mode. But C++2a has a very large footprint, and I may well be failing to imagine something that a sane user would complain about.

The situation with co_yield seems analogous to how I imagine you must currently deal with fold-expressions in C++11/17 code. If you're in "Cpp11" mode, and you see (xs - ... - 1), you don't reformat it to (xs -... -1) merely because C++11 lacks fold-expressions, do you?

clang/docs/ClangFormatStyleOptions.rst
2237

The convention started with C++0x, the standard that followed C++03. (As 2009 passed, the joke became that "0x" must stand for something hexadecimal.) Eventually, C++0x turned into C++11. So then the next version was C++1y (since "y" comes after "x"), which became C++14. The version after that was C++1z, which became C++17. The version after C++17 is called "C++2a" because from the beginning it was targeting a 2020 release date (hence the "2"), and because "a" comes after "z" if you're used to modular arithmetic. As a special mnemonic bonus, "C++1z" happened to be the last C++ release of the 2010s, and "C++2a" will be the first C++ release of the 2020s.
If the convention continues — and I see no reason it shouldn't — then C++2b will be '23 and C++2c will be '26. (That is, if WG21 keeps to its grueling three-year release cycle. I hope and pray that they won't, but, we'll see.)

clang/lib/Format/Format.cpp
2384–2387

Incidentally, these four lines seem like a great place to use Style.Standard >= FormatStyle::LS_CppWhatever (with a cast if necessary), unless that's against some style rule.

clang/unittests/Format/FormatTest.cpp
13954

If you're going to test C++11's behavior here, please use co_yield - 1; or something else that might reasonably appear in a C++11 program. co_yield++ i; is not valid C++ (unless i is a macro, I guess).

modocache marked 2 inline comments as done.Fri, Jul 26, 7:44 AM
modocache added inline comments.
clang/lib/Format/Format.cpp
2384–2387

This has already been updated to do so in D65183.

clang/unittests/Format/FormatTest.cpp
13954

i would be an iterator in this case.

Quuxplusone added inline comments.Fri, Jul 26, 8:55 AM
clang/unittests/Format/FormatTest.cpp
13954

No matter what i is, co_yield++ i; is never a valid sequence of tokens in C++17 (unless i is a macro, I guess).

co_yield ++i; will become valid in C++2a, but if you see that sequence of tokens in any C++ program ever, you can just assume that it is C++2a. There is never any reason to think that it should be formatted as co_yield++ i;, because that would not be valid C++17.

modocache updated this revision to Diff 213360.Mon, Aug 5, 7:23 AM

Thanks for the reviews, @sammccall, @Quuxplusone, and @MyDeveloperDay. I added C++14 and C++17 options. In an earlier comment I mentioned splitting this work up into a series of commits, but it ended up being a smaller set of changes than I thought, so I'll just update this diff with all of the changes. What do you all think?

modocache marked 6 inline comments as done.Mon, Aug 5, 7:26 AM

Friendly ping! @sammccall is this what you had in mind?

FWIW, looks unobjectionable to me, except for some nitpicks on the test cases (which are easy to fix so I'm hoping you just will :)).

clang/unittests/Format/FormatTest.cpp
13789

My comment https://reviews.llvm.org/D65043#inline-582865 still stands. Please just write

verifyFormat("f(co_yield - 1);");
verifyFormat("f(co_yield -1);", Cpp2a);
13792

FWIW, []() { co_return; } is not valid C++2a either. You can't have a lambda that is a coroutine and also has a deduced return type. Make this

verifyFormat("co_await []() -> g { co_return; }();", Cpp2a);

if you want it to be plausible C++2a.

modocache updated this revision to Diff 214168.Thu, Aug 8, 9:01 AM
modocache marked 3 inline comments as done.

Thanks for the review, @Quuxplusone! I addressed your test comments, but the 'co_yield' test is something I think I'll need to deal with separately.

modocache added inline comments.Thu, Aug 8, 9:01 AM
clang/unittests/Format/FormatTest.cpp
13789

Unfortunately I think there's a separate issue here, that co_yield -1 is formatted as co_yield - 1 no matter which C++ standard is used. I'd like to address that in a separate commit, if that's alright. For now, I removed the test verifying the C++17 behavior, and instead only test the correct C++20 formatting of co_yield ++it.

Sorry for taking so long here, I've been swamped.

Unfortunately there's a couple of pain points still:

  • this change is (mostly?) about being able to turn off c++20 parsing, so you preserve old desirable formatting of c++17 code. But the tests don't contain any such code, and I haven't seen any in the review. What does a real example look like?
  • I misunderstood, while coroutines are newly added to c++20 mode, we've been shipping c++20 formatting like <=> since at least 2017. So the arguments for giving Cpp11 a silly back-compat meaning probably sholud cover c++20 features too.

At this point, would we be better off just:

  • adding 'Latest' -> LS_Latest
  • mapping 'Cpp11' -> LS_Latest with a backward-compatibility comment
  • adding 'c++11' -> LS_cxx11 with sensible semantics (name matches clang's LangStandards.def)
  • adding 'c++14' -> LS_cxx14 etc
  • keeping Auto as today
clang/include/clang/Format/Format.h
1875

If we're going to make "Cpp11" exclude Cpp2a, we should probably try to stop this problem occurring again, by including LS_Latest, I think.

But I'm not sure we should exclude 2a, see below...

clang/lib/Format/Format.cpp
2386

I think this should be LexingStd >= Cpp17 || LexingStd == LS_Cpp11. no need to have weird behavior for the new Cpp14 value.

Similarly the previous line is probably clearer (though equivalent) as Std >= 14 || Std == 11

clang/lib/Format/TokenAnnotator.cpp
2865

<=

2884

<=

clang/unittests/Format/FormatTest.cpp
3811

This test has been around for a couple of years, I guess we've been treating Cpp11 as triggering the 2a parser since then. This looks like a regression.

13789

So unfortunately as this stands, it doesn't seem to test that the 2a flag makes any difference. What exactly is the formatting of non-2a code you're trying to preserve? Can we test that?