This is an archive of the discontinued LLVM Phabricator instance.

Make wide multi-character character literals ill-formed
ClosedPublic

Authored by cor3ntin on Jul 17 2021, 2:37 AM.

Details

Summary

This implements P2362, which has not het been approved by the
C++ committee, but because wide-multi character literals are
implementation defined, clang might not have to wait for WG21.

While I think this change could be applied to all languages - notably
C, I do not know whether there would be breakage from it.

The other part of P2362, making non-representable character
literals ill-formed, is already implemented by clang

Diff Detail

Event Timeline

cor3ntin requested review of this revision.Jul 17 2021, 2:37 AM
cor3ntin created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2021, 2:37 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

I think that C and C++ should behave the same here; at least, I don't see any reason why they should have different capabilities.

The paper said that there is no expected code breakage from this change, but have you tried building a diverse corpus of code (like a distro's worth of packages) under this patch to see if anything actually breaks in practice? (I don't expect breakage that isn't identifying an actual issue in the code, but having some verification would be appreciated.) This would also help to identify whether the change is appropriate for C as well.

clang/test/CodeGen/string-literal-short-wstring.c
1–5

No need to tell the file to be in C mode; it is by virtue of the extension.

19

I'd feel most comfortable if the top of the file did:

typedef __WCHAR_TYPE__ wchar_t;

as that's the definition of wchar_t from the stddef.h provided by Clang.

I think that C and C++ should behave the same here; at least, I don't see any reason why they should have different capabilities.

I agree but as WG14 hasn't weighted in I didn't want to make that call.
What do you think?

The paper said that there is no expected code breakage from this change, but have you tried building a diverse corpus of code (like a distro's worth of packages) under this patch to see if anything actually breaks in practice? (I don't expect breakage that isn't identifying an actual issue in the code, but having some verification would be appreciated.) This would also help to identify whether the change is appropriate for C as well.

We have done regexes over various repositories (every vcpkg package) with no match. Not running a complete compiler

I think that C and C++ should behave the same here; at least, I don't see any reason why they should have different capabilities.

I agree but as WG14 hasn't weighted in I didn't want to make that call.
What do you think?

My reading of C2x is that this is implementation-defined there as well.

6.4.4.4p13:

A wide character constant prefixed by the letter L has type wchar_t, an integer type defined in the
<stddef.h> header; a wide character constant prefixed by the letter u or U has type char16_t or
char32_t, respectively, unsigned integer types defined in the <uchar.h> header. The value of a
wide character constant containing a single multibyte character that maps to a single member of the
extended execution character set is the wide character corresponding to that multibyte character,
as defined by the mbtowc, mbrtoc16, or mbrtoc32 function as appropriate for its type, with an
implementation-defined current locale. The value of a wide character constant containing more
than one multibyte character or a single multibyte character that maps to multiple members of
the extended execution character set, or containing a multibyte character or escape sequence not
represented in the extended execution character set, is implementation-defined.

Do you agree?

The paper said that there is no expected code breakage from this change, but have you tried building a diverse corpus of code (like a distro's worth of packages) under this patch to see if anything actually breaks in practice? (I don't expect breakage that isn't identifying an actual issue in the code, but having some verification would be appreciated.) This would also help to identify whether the change is appropriate for C as well.

We have done regexes over various repositories (every vcpkg package) with no match. Not running a complete compiler

Regexes are a good start but they miss the goofy (and sometimes awful) stuff that people do with token pasting, line continuations, and other random tricks. Would you be willing to try this as an experiment, or am I asking too much? :-) My thinking is that if we don't see any breakage from compiling a diverse corpus of code, we've done enough due diligence to suggest this is safe for both C and C++, but if we see some breakage, we can either identify that there's some valid use for this that we've not considered (less likely) and would be informative for both WG21 and WG14, or we can identify that we helped find bugs in real world code (more likely) which is also good feedback for the committees.

cor3ntin updated this revision to Diff 366265.Aug 13 2021, 6:29 AM

Improve tests

cor3ntin marked an inline comment as done.Aug 13 2021, 6:30 AM
cor3ntin added inline comments.
clang/test/CodeGen/string-literal-short-wstring.c
19

Thanks, I was looking for exactly that

cor3ntin marked an inline comment as done.Aug 13 2021, 6:38 AM

I think that C and C++ should behave the same here; at least, I don't see any reason why they should have different capabilities.

I agree but as WG14 hasn't weighted in I didn't want to make that call.
What do you think?

My reading of C2x is that this is implementation-defined there as well.

6.4.4.4p13:

A wide character constant prefixed by the letter L has type wchar_t, an integer type defined in the
<stddef.h> header; a wide character constant prefixed by the letter u or U has type char16_t or
char32_t, respectively, unsigned integer types defined in the <uchar.h> header. The value of a
wide character constant containing a single multibyte character that maps to a single member of the
extended execution character set is the wide character corresponding to that multibyte character,
as defined by the mbtowc, mbrtoc16, or mbrtoc32 function as appropriate for its type, with an
implementation-defined current locale. The value of a wide character constant containing more
than one multibyte character or a single multibyte character that maps to multiple members of
the extended execution character set, or containing a multibyte character or escape sequence not
represented in the extended execution character set, is implementation-defined.

Do you agree?

Yes, I agree.
I think clang could make it ill-formed if it wanted to!
If we want to do that we could probably remove some more code :)

The paper said that there is no expected code breakage from this change, but have you tried building a diverse corpus of code (like a distro's worth of packages) under this patch to see if anything actually breaks in practice? (I don't expect breakage that isn't identifying an actual issue in the code, but having some verification would be appreciated.) This would also help to identify whether the change is appropriate for C as well.

We have done regexes over various repositories (every vcpkg package) with no match. Not running a complete compiler

Regexes are a good start but they miss the goofy (and sometimes awful) stuff that people do with token pasting, line continuations, and other random tricks. Would you be willing to try this as an experiment, or am I asking too much? :-) My thinking is that if we don't see any breakage from compiling a diverse corpus of code, we've done enough due diligence to suggest this is safe for both C and C++, but if we see some breakage, we can either identify that there's some valid use for this that we've not considered (less likely) and would be informative for both WG21 and WG14, or we can identify that we helped find bugs in real world code (more likely) which is also good feedback for the committees.

Unless there is a script to do that easily, I'm not sure I'll be able to get to it any time soon.
But really, there is 0 use for these things! And you can't do much goofiness L ## 'ab' certainly - but that wouldn't be very useful either

I think that C and C++ should behave the same here; at least, I don't see any reason why they should have different capabilities.

I agree but as WG14 hasn't weighted in I didn't want to make that call.
What do you think?

My reading of C2x is that this is implementation-defined there as well.

6.4.4.4p13:

A wide character constant prefixed by the letter L has type wchar_t, an integer type defined in the
<stddef.h> header; a wide character constant prefixed by the letter u or U has type char16_t or
char32_t, respectively, unsigned integer types defined in the <uchar.h> header. The value of a
wide character constant containing a single multibyte character that maps to a single member of the
extended execution character set is the wide character corresponding to that multibyte character,
as defined by the mbtowc, mbrtoc16, or mbrtoc32 function as appropriate for its type, with an
implementation-defined current locale. The value of a wide character constant containing more
than one multibyte character or a single multibyte character that maps to multiple members of
the extended execution character set, or containing a multibyte character or escape sequence not
represented in the extended execution character set, is implementation-defined.

Do you agree?

Yes, I agree.
I think clang could make it ill-formed if it wanted to!
If we want to do that we could probably remove some more code :)

I don't see a reason why we'd want C and C++ to diverge here for our implementation, so I'd say let's do C as well. @rsmith -- do you have any concerns with that?

The paper said that there is no expected code breakage from this change, but have you tried building a diverse corpus of code (like a distro's worth of packages) under this patch to see if anything actually breaks in practice? (I don't expect breakage that isn't identifying an actual issue in the code, but having some verification would be appreciated.) This would also help to identify whether the change is appropriate for C as well.

We have done regexes over various repositories (every vcpkg package) with no match. Not running a complete compiler

Regexes are a good start but they miss the goofy (and sometimes awful) stuff that people do with token pasting, line continuations, and other random tricks. Would you be willing to try this as an experiment, or am I asking too much? :-) My thinking is that if we don't see any breakage from compiling a diverse corpus of code, we've done enough due diligence to suggest this is safe for both C and C++, but if we see some breakage, we can either identify that there's some valid use for this that we've not considered (less likely) and would be informative for both WG21 and WG14, or we can identify that we helped find bugs in real world code (more likely) which is also good feedback for the committees.

Unless there is a script to do that easily, I'm not sure I'll be able to get to it any time soon.
But really, there is 0 use for these things! And you can't do much goofiness L ## 'ab' certainly - but that wouldn't be very useful either

Okay, I'm probably making too big of an ask here. I can't think of reasonable code that would be broken by this change. If that turns out to be incorrect, we can address it when we get a real world use case.

cor3ntin updated this revision to Diff 366279.Aug 13 2021, 7:43 AM

Make wide multi char ill-formed regardless of language mode

I'm okay with the changes, but I think we should call this out in the release notes as there's technically the potential for this to be a breaking change (in both C and C++).

cor3ntin updated this revision to Diff 366317.Aug 13 2021, 11:20 AM

Describe this change in the changelog.

aaron.ballman accepted this revision.Aug 13 2021, 11:32 AM

LGTM, but going to wait a bit to land it in case @rsmith has opinions.

This revision is now accepted and ready to land.Aug 13 2021, 11:32 AM
cor3ntin retitled this revision from Make wide multi-character character literals ill-formed in C++ mode to Make wide multi-character character literals ill-formed.Aug 13 2021, 11:43 AM