This is an archive of the discontinued LLVM Phabricator instance.

[Clang] String Literal and Wide String Literal Encoding from the Preprocessor
ClosedPublic

Authored by ThePhD on Apr 12 2021, 3:03 PM.

Details

Summary

Short version

Please let us know the encoding that the compiler chooses for it's implementation-defined, non-Unicode string literals so we can support users properly in cross-platform code.

Prior Art

A similar feature has already been patch-reviewed and merged into GCC trunk (I implemented it), ready to go out the door with GCC 11. It is compiler-specific, and that is intentional. It solved a user's bug report there.

I also put in a Feature Request for MSVC. It is also recommended to be compiler-specific. They are currently suffering the very interesting consequences of not handling it sooner. Stdlib library developers working on MSVC have to come up with library-based workarounds to determine the charset format of their string literals and praying rather than having a compiler macro for it with their std::fmt implementation: https://github.com/microsoft/STL/pull/1824 | https://github.com/microsoft/STL/issues/1576 | https://developercommunity.visualstudio.com/content/idea/1160821/-compiler-feature-macro-for-narrow-literal-foo-enc.html

The C++ Standard's Committee Study Group 16 - Unicode approved a paper that is currently undergoing LEWG to determine the string literal and wide string literal encoding at both compile-time and runtime; this patch prepares for the compile-time portion of that detection, which Corentin Jabot already created a proof-of-concept of for Clang, GCC and MSVC: https://wg21.link/p1885

I missed the 12 release, so I hope this makes it for 13.

Long version

C and C++'s string literals for both "narrow"/"multibyte" string literals (e.g. "foo") and "wide" string literals (e.g. L"foo") have an associated encoding defined by the implementation. Recently, a review has kicked off for both adding new "execution encodings" (e.g., string literal encodings) to Clang's Preprocessor and, subsequently, C and C++ frontends.

I left a comment for it to be taken care of but I'm certain the comment was drowned out by other contributions in both the -fexec-charset addition and the "Add support for iconv encodings and other things" patch review at:

iconv Literal Converter: https://reviews.llvm.org/D88741
-fexec-charset Enabling Patch: https://reviews.llvm.org/D93031

Whether or not this gets updated in the related (but not required) patches, this is necessary to successfully inform the end user on a Clang machine what the wide string literal and the narrow string literal encoding is.

We use the size of the wide character type (wchar_t) to inform our decision, as Windows and other old-style 32-bit IBM machines use UTF-16, while most Linux distributions use UTF-32. (This is not the case for IBM and other machines of specific make in China, Japan, and Korea, but I suspect Clang has not been ported to work on such machines.)

Knowing the literal and execution encodings is also of great importance to the C++ Standard Committee in general, as they have work coming down the pipeline that has been generally approved by SG16 and favorably reviewed by LEWG that will make use of such functionality soon, as mentioned in the Prior Art section above: https://wg21.link/p1885

Please consider making everyone who cares about portable encoding's lives easier, and please consider making the work on -fexec-charset and -fwide-exec-charset smoother for the end-user by approving these changes so I can cycle back to those other commits with a serious commitment to making sure end-users can know and act on the encoding of (wide) string literals.

Diff Detail

Event Timeline

ThePhD requested review of this revision.Apr 12 2021, 3:03 PM
ThePhD created this revision.
ThePhD edited the summary of this revision. (Show Details)Apr 12 2021, 3:03 PM
ThePhD edited the summary of this revision. (Show Details)Apr 12 2021, 3:05 PM
ThePhD edited the summary of this revision. (Show Details)Apr 12 2021, 3:08 PM
ThePhD edited the summary of this revision. (Show Details)
ThePhD updated this revision to Diff 337008.Apr 12 2021, 6:00 PM

Fixes formatting derps in the pre-linter check.

rsmith added a subscriber: rsmith.Apr 12 2021, 6:59 PM

Exposing this information seems fine to me. I think it'd be more useful to expose it in a way the preprocessor can inspect, but it's hard to see how to do that without requiring us to invent identifier-shaped names for all encodings, and perhaps we shouldn't be in the business of doing that. If this string literal approach addresses the use cases you have, then I think that's OK.

clang/test/Preprocessor/init.c
122–123

Please add documentation to docs/LanguageExtensions.rst for these.

ThePhD updated this revision to Diff 337024.Apr 12 2021, 7:21 PM

The tests check for macros in strict byte-value "alphabetical" order. We also need documentation, as was suggested!!

ThePhD marked an inline comment as done.Apr 12 2021, 7:29 PM

Exposing this information seems fine to me. I think it'd be more useful to expose it in a way the preprocessor can inspect ...

The reason it's done like this is because the linked patches are going to use iconv, and iconv (while maintaining a fixed set of encodings it can support with canonical names that can be passed to its tools) can have encodings of arbitrary name added to it. If the eventual -f(wide-?)exec-charset options just pass-through to iconv, it's fair to say any translation we do will become out of date or unsuitable on a given platform. That's why I just want to expose the name! I did the same thing for GCC's functionality.

Microsoft is a bit luckier in that they have a closed set of encodings and a stable mapping between Name <-> Code Page Identifier, but they don't even cover the same list of encodings as iconv covers in its default distribution. Corentin Jabot's https://wg21.link/p1885 gives a canonical mapping, but again it's a closed-set of mappings and iconv is inherently extensible (at the "I rebuilt the library and installed it on my OS" level)!

The string literals work for people because, despite not being preprocessor-comparable, they can be manipulated at compile-time and switched on in the usual ways at constexpr time. See usages in:

https://github.com/soasis/text/blob/main/include/ztd/text/detail/encoding_name.hpp#L198
https://github.com/soasis/text/blob/main/include/ztd/text/literal.hpp#L54

, which can be used at compile-time like:

https://github.com/soasis/text/blob/main/tests/basic_compile_time/source/validate_code_points.cpp#L45

clang/test/Preprocessor/init.c
122–123

Done and dusted!

ThePhD updated this revision to Diff 337035.Apr 12 2021, 8:21 PM
ThePhD marked an inline comment as done.

Thank you for this patch, I think this is useful functionality!

The string literals work for people because, despite not being preprocessor-comparable, they can be manipulated at compile-time and switched on in the usual ways at constexpr time. See usages in:

https://github.com/soasis/text/blob/main/include/ztd/text/detail/encoding_name.hpp#L198
https://github.com/soasis/text/blob/main/include/ztd/text/literal.hpp#L54

, which can be used at compile-time like:

https://github.com/soasis/text/blob/main/tests/basic_compile_time/source/validate_code_points.cpp#L45

What about for folks using this from C where there isn't constexpr functionality to help them?

clang/docs/LanguageExtensions.rst
387

In both cases, defined to a string *in what encoding* (<rimshot.wav>)? More seriously, we should probably be clear that this expands to a narrow string literal rather than, say, a const char * or std::string.

388

This is typically -> This macro typically expands to

(same below).

clang/lib/Frontend/InitPreprocessor.cpp
781

macros -> Macros

782

NOTE -> FIXME

ThePhD updated this revision to Diff 337158.Apr 13 2021, 8:20 AM

Change NOTE: to FIXME:.

Update the text for the documentation to be more explicit.

ThePhD marked 4 inline comments as done.Apr 13 2021, 8:57 AM

...

What about for folks using this from C where there isn't constexpr functionality to help them?

The unfortunate bit is that C won't be able to guarantee compile-time culling of branches, even if all values present are constexpr. Implementations like Clang and GCC have significantly powerful enough optimizers that usage of strcmp and memcmp can be recognized and turned into builtins, before being const-folded down. This can provide dead code elimination. But, otherwise, C can't guarantee elision of the code branches without an integral constant identifier in a Macro. This is more or less a deficiency in how weak Constant Expressions are in C, leaving most people to rely on implementation-defined behavior for constant folding to do anything worthwhile with their toolset.

I think in the future, if we are really invested in this path, we should come up with a canonical mapping and a specific way of saying "if this is a recognized encoding name it has an Integer Constant Expression of value X as defined by table Y in the documentation". We could provide macros __clang_literal_encoding_id__ and __clang_wide_literal_encoding_id that has the X integer constant expression value. But I think that should be a follow-on patch that evaluates the totality of encodings, and also maybe contacts some IBM folks who did the -fexec-charset patches so they can also give over any additional encoding mappings they want.

Because Clang is open, anyone could add to it and that way people could have that kind of ability in C. iconv has a very full list of encodings, and you'd also need to define a resistant equality function similar to what's implemented in soasis/text (https://github.com/soasis/text/blob/main/include/ztd/text/detail/encoding_name.hpp#L120) or in P1885 (http://wg21.link/p1885) so you can compare names in a consistent manner across platforms. After that equality you then yield the integer value, and then you'd go from there. It doesn't have to be standard, just compiler-specific.

I'm not sure all of that belongs in this patch, though, and I think I'd wait for the other patches about iconv literal converters to drop before having the fullness of that conversation.

ThePhD updated this revision to Diff 337178.Apr 13 2021, 9:01 AM
ThePhD set the repository for this revision to rG LLVM Github Monorepo.

Oops, almost forgot the doc fixes for the *wide__* macro!

aaron.ballman accepted this revision.Apr 13 2021, 10:51 AM

...

What about for folks using this from C where there isn't constexpr functionality to help them?

...

I'm not sure all of that belongs in this patch, though, and I think I'd wait for the other patches about iconv literal converters to drop before having the fullness of that conversation.

Okay, I'm sold. Thank you for the detailed explanation! The changes LGTM. Do you have commit privileges or would you like me to commit on your behalf? (If you'd like me to commit, what email address and name would you like me to use for commit attribution?)

This revision is now accepted and ready to land.Apr 13 2021, 10:51 AM

Okay, I'm sold. Thank you for the detailed explanation! The changes LGTM. Do you have commit privileges or would you like me to commit on your behalf? (If you'd like me to commit, what email address and name would you like me to use for commit attribution?)

I don't have commit privileges! You should use ThePhD <phdofthehouse@gmail.com>.

Okay, I'm sold. Thank you for the detailed explanation! The changes LGTM. Do you have commit privileges or would you like me to commit on your behalf? (If you'd like me to commit, what email address and name would you like me to use for commit attribution?)

I don't have commit privileges! You should use ThePhD <phdofthehouse@gmail.com>.

Thank you for the contribution! I've commit it on your behalf in 701d70d4c25c4e02b303ba6dee1495708496f615.