Page MenuHomePhabricator

[clang] Define __STDC_ISO_10646__
Needs ReviewPublic

Authored by cor3ntin on Jul 22 2021, 11:05 AM.

Details

Summary

Both C and C++ standard suggest that the macro

STDC_ISO_10646 should be defined when
wchar_t can represent all unicode characters.

Because pending D93031, clang only support
UTF-16 or UTF-32 as the wide execution encoding,
wchar_t can represent all Unicode characters
whenever wchar_t is at least 21 bits.

But clang only seem to support 2 or 4 bytes
wchar_t so only define STDC_ISO_10646
when wchar_t is at least 4 bytes.

The value is rather meaning less as
the version of Unicode is unlikely affect
its codespace, or we would have bigger problems.
We set it to the publication of Unicode 13.

Note that this define was set up already
in the CloudABI target.

This project seems unmaintained,
and for the sake of simplicity i elected to change
the value used there.
(only the presence of the macro is likely to be checked).

When D93031 is approved this code can be
extended to check a user defined wide execution
encoding. We could also imagine a per-target (OS)
default wide execution encoding. EG for wide ebcdic.

Diff Detail

Unit TestsFailed

TimeTest
2,720 msx64 debian > libarcher.barrier::barrier.c
Script: -- : 'RUN: at line 14'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/barrier/barrier.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/barrier/Output/barrier.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/barrier/Output/barrier.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/barrier/Output/barrier.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/barrier/barrier.c
2,970 msx64 debian > libarcher.critical::critical.c
Script: -- : 'RUN: at line 15'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/critical/critical.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/critical.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/critical.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/critical.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/critical/critical.c
2,880 msx64 debian > libarcher.critical::lock-nested.c
Script: -- : 'RUN: at line 15'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/critical/lock-nested.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/lock-nested.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/lock-nested.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/lock-nested.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/critical/lock-nested.c
2,930 msx64 debian > libarcher.critical::lock.c
Script: -- : 'RUN: at line 15'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/critical/lock.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/lock.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/lock.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/lock.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/critical/lock.c
2,870 msx64 debian > libarcher.parallel::parallel-simple2.c
Script: -- : 'RUN: at line 15'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/parallel/parallel-simple2.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/parallel/Output/parallel-simple2.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/parallel/Output/parallel-simple2.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/parallel/Output/parallel-simple2.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/parallel/parallel-simple2.c
View Full Test Results (19 Failed)

Event Timeline

cor3ntin created this revision.Jul 22 2021, 11:05 AM
cor3ntin requested review of this revision.Jul 22 2021, 11:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2021, 11:05 AM
aaron.ballman added a subscriber: aaron.ballman.

Can you also add a test that explicitly uses -fshort-wchar to verify that we don't predefine the macro?

clang/lib/Frontend/InitPreprocessor.cpp
430–435

Are we going to have to remember to update this value each time Unicode updates?

436–438

Can you also add a test that explicitly uses -fshort-wchar to verify that we don't predefine the macro?

There is one already! (init.c L288)

clang/lib/Frontend/InitPreprocessor.cpp
430–435

No. But I don't think it's matter, as long as it is defined, GCC hasn't updated since 2017

cor3ntin updated this revision to Diff 360914.Jul 22 2021, 11:57 AM

Remove braces, improve comment (Thanks Aaron)

aaron.ballman accepted this revision.Jul 22 2021, 12:09 PM

Can you also add a test that explicitly uses -fshort-wchar to verify that we don't predefine the macro?

There is one already! (init.c L288)

Oh, thank you for pointing that out!

LGTM!

clang/lib/Frontend/InitPreprocessor.cpp
430–435

Okay, I don't mind updating the value when it becomes a problem someone notices.

This revision is now accepted and ready to land.Jul 22 2021, 12:09 PM

I'm not sure we should be populating this.

The _value_ is determined by what libc supports, so it probably needs to be left up to libc to define it.

In glibc, this define is set by the file /usr/include/stdc-predef.h, which GCC implicitly includes into all TUs whether they include libc headers or not. (Clang almost got that feature too in https://reviews.llvm.org/D34158 but it was reverted due to some test failures)

I'm not sure we should be populating this.

The _value_ is determined by what libc supports, so it probably needs to be left up to libc to define it.

Why is the value determined by what libc supports? The definition from the standard is:

If this symbol is defined, then every character in the Unicode required set, when stored in an
object of type wchar_t, has the same value as the short identifier of that character.

That doesn't seem to imply anything about the library, just the size of wchar_t.

I'm not sure we should be populating this.

The _value_ is determined by what libc supports, so it probably needs to be left up to libc to define it.

Why is the value determined by what libc supports? The definition from the standard is:

If this symbol is defined, then every character in the Unicode required set, when stored in an
object of type wchar_t, has the same value as the short identifier of that character.

That doesn't seem to imply anything about the library, just the size of wchar_t.

That's my reading as well.
This is only about representability, not about what the locale features can handle (if anything).
In particular, setting it only on linux would be less than useful.
The primary use of this macro is to detect environments where wchar_t is not UTF-32, namely windows and IBM platforms.
So this should be defined on all relevant platforms, not just the ones which use glibc

Note that this patch came about because someone had trouble with that macro not being present on mac.

I'm not sure we should be populating this.

The _value_ is determined by what libc supports, so it probably needs to be left up to libc to define it.

Why is the value determined by what libc supports? The definition from the standard is:

If this symbol is defined, then every character in the Unicode required set, when stored in an
object of type wchar_t, has the same value as the short identifier of that character.

That doesn't seem to imply anything about the library, just the size of wchar_t.

Huh. So it doesn't! Not sure why I got that idea. Disregard my comment then -- I agree that just setting it to some arbitrary unicode version ought to be OK, and there's probably no need to ever change it, since the set of possible unicode values isn't ever going to change.

And if the C library also defines this macro in its headers, that's ok, since -Wmacro-redefinition is suppressed by system headers.

I'm not sure we should be populating this.

The _value_ is determined by what libc supports, so it probably needs to be left up to libc to define it.

Why is the value determined by what libc supports? The definition from the standard is:

If this symbol is defined, then every character in the Unicode required set, when stored in an
object of type wchar_t, has the same value as the short identifier of that character.

That doesn't seem to imply anything about the library, just the size of wchar_t.

Every character in the Unicode required set encoded in what way? To say that such a character is stored in an object of type wchar_t means that interpreting the wchar_t yields that stored character. Methods to determine the interpretation of the stored wchar_t value include locale-sensitive functions such as wcstombs (and thus is tied to libc).

I'm not sure we should be populating this.

The _value_ is determined by what libc supports, so it probably needs to be left up to libc to define it.

Why is the value determined by what libc supports? The definition from the standard is:

If this symbol is defined, then every character in the Unicode required set, when stored in an
object of type wchar_t, has the same value as the short identifier of that character.

That doesn't seem to imply anything about the library, just the size of wchar_t.

Every character in the Unicode required set encoded in what way? To say that such a character is stored in an object of type wchar_t means that interpreting the wchar_t yields that stored character. Methods to determine the interpretation of the stored wchar_t value include locale-sensitive functions such as wcstombs (and thus is tied to libc).

"has the same value as the short identifier of that character." implies UTF-32.
There is no mention of interpretation here, the *value* is the same. As in, when casting to an integer type you get the code point value.
*Storing* that value might involve either assigning from a wide-character literal or mbrtowc.
Both methods imply some transcoding, the latter of which could be affected by locale such that it would store a different character, but again, is it related to this wording?

Note that by virtue of being a macro this cannot possibly be affected by locale.

A few scenarios

  • The encoding of wide literal as determined by clang is not utf-32, the macro should be defined by neither the compiler nor the library
  • The encoding of wide literals as determined by the compiler is utf-32, libc agrees... this works as intended
  • The encoding of wide literals as determined by the compiler is utf-32, libc disagrees... nothing good can come of that.

The compiler and the libc have to agree here.
The library cannot (should not) define this macro without knowing the wide literal encoding.

Note that both standards imply that these macros should be defined when relevant independently of the environment which includes hosted and non-Linux+glibc platforms. So relying on a specific glibc implementation
seems dubious. Especially as glibc will *always* define that macro

Now, I agree that the compiler and the library should ideally expose the same *value* for this macro (although I struggle to find code that actually relies on the value)

When D34158 as mentioned by @jyknight lands, the value will be set to that of the library version thereby overriding the compiler default.
On other systems, the value will be set to the library version whenever the library is included.

When we add support for non-utf wide execution encoding, we can use that information to not define this macro.

joerg added a subscriber: joerg.Jul 23 2021, 4:35 AM

This patch is certainly wrong for NetBSD as the wchar_t encoding is up to the specific locale charset and *not* UCS-2 or UCS-4 for certain legacy encodings like the various shift encodings in East Asia.

aaron.ballman requested changes to this revision.Jul 23 2021, 4:40 AM

Every character in the Unicode required set encoded in what way? To say that such a character is stored in an object of type wchar_t means that interpreting the wchar_t yields that stored character. Methods to determine the interpretation of the stored wchar_t value include locale-sensitive functions such as wcstombs (and thus is tied to libc).

"has the same value as the short identifier of that character." implies UTF-32.
There is no mention of interpretation here, the *value* is the same. As in, when casting to an integer type you get the code point value.

This is how I interpret the words from the standard as well. I think it's purely about the bit width of wchar_t and whether it's wide enough to hold all Unicode code points as of a particular Unicode standard release.

I tried to do some archeology to see how this predefined macro came into existence. It was added in C99 at a time before we seemed to be collecting editors reports and there are no obvious papers on the topic, so I don't know what proposal added the feature. The C99 rationale document does not mention the macro at all, but from my reading of the rationale, it seems possible that this macro is related to the introduction of UCNs and whether \Unnnnnnnn can be stored in a wchar_t.

One thing I did find when doing my research though was: https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html which says, in part,

The standard defines at least a macro __STDC_ISO_10646__ that is only defined on systems where the wchar_t type encodes ISO 10646 characters. If this symbol is not defined one should avoid making assumptions about the wide character representation.

This matches the interpretation that the libc encoding is salient

but... we still need to define what happens in freestanding environments where there is no libc with character conversion functions, so it also sounds like it's partially in the realm of the compiler.

*Storing* that value might involve either assigning from a wide-character literal or mbrtowc.
Both methods imply some transcoding, the latter of which could be affected by locale such that it would store a different character, but again, is it related to this wording?

Note that by virtue of being a macro this cannot possibly be affected by locale.

A few scenarios

  • The encoding of wide literal as determined by clang is not utf-32, the macro should be defined by neither the compiler nor the library
  • The encoding of wide literals as determined by the compiler is utf-32, libc agrees... this works as intended
  • The encoding of wide literals as determined by the compiler is utf-32, libc disagrees... nothing good can come of that.

The compiler and the libc have to agree here.
The library cannot (should not) define this macro without knowing the wide literal encoding.

I agree that the compiler and libc need to agree on the encoding.

Note that both standards imply that these macros should be defined when relevant independently of the environment which includes hosted and non-Linux+glibc platforms. So relying on a specific glibc implementation
seems dubious. Especially as glibc will *always* define that macro

I think the point was more about "who is generally responsible for defining this macro, the compiler or the library" as opposed to it being a glibc thing specifically. I notice that musl also defines the macro (https://git.musl-libc.org/cgit/musl/tree/include/stdc-predef.h#n4).

Now, I agree that the compiler and the library should ideally expose the same *value* for this macro (although I struggle to find code that actually relies on the value)

When D34158 as mentioned by @jyknight lands, the value will be set to that of the library version thereby overriding the compiler default.
On other systems, the value will be set to the library version whenever the library is included.

I think that's the correct behavior. The compiler says "my wchar_t encodes ISO 10646" and the library has the chance to say "my wide char functions expect something else" if need be.

Given that there's two people who think this macro relates to the standard library, I'm going to mark review as needing changes so we don't accidentally land it. I think we should ask for an interpretation on the WG14 reflectors and come back once we have more information.

This revision now requires changes to proceed.Jul 23 2021, 4:40 AM

This patch is certainly wrong for NetBSD as the wchar_t encoding is up to the specific locale charset and *not* UCS-2 or UCS-4 for certain legacy encodings like the various shift encodings in East Asia.

How does the value of a macro get impacted by a runtime locale?

This patch is certainly wrong for NetBSD as the wchar_t encoding is up to the specific locale charset and *not* UCS-2 or UCS-4 for certain legacy encodings like the various shift encodings in East Asia.

How does the value of a macro get impacted by a runtime locale?

More importantly, given that clang unconditionally encodes wide strings as UTF-32, how can, for example towupper(L'α') produce a sensible result?

Even after the more recent discussion, I still think my initial message was incorrect, and that the compiler should be defining this macro itself, as proposed in this patch. Note that my confusion was not that the macro being defined or not was dependent on libc behavior, only the precise value it should be defined to.

Responding to a couple points:

I think the point was more about "who is generally responsible for defining this macro, the compiler or the library" as opposed to it being a glibc thing specifically. I notice that musl also defines the macro (https://git.musl-libc.org/cgit/musl/tree/include/stdc-predef.h#n4).

Exactly so. *IF* this macro relates to library behavior, then libraries should define it -- and not just glibc. Other systems could/should provide a stdc-predef.h file as well. (but per above, I don't think this is the case here.)

This patch is certainly wrong for NetBSD as the wchar_t encoding is up to the specific locale charset and *not* UCS-2 or UCS-4 for certain legacy encodings like the various shift encodings in East Asia.

Yet, the compiler currently always puts UTF-16/UTF-32 in wchar_t string literals. If that is inconsistent with the runtime, then the system as a whole currently has a serious bug. There is currently no platform that Clang uses a non-UTF encoding for wchar_t for. If there were some such platform, it would then be correct to not define this macro for that platform. There's no getting away from the compiler needing to be aware of the encoding of wchar_t, independent from this patch, so there's no point in punting the definition of the macro to the libc.

Now, maybe FreeBSD should be such a platform that uses a different wchar_t encoding...which leads to the question: what is the encoding Clang should be using here? What *should* L"\U00100000" emit? It sounds like wchar_t doesn't even have a consistent encoding at runtime, which implies that there's no way the compiler can create a correct wchar_t string literal. So maybe it should simply throw a compilation error if you try to use L"" or L''?

Per https://www.gnu.org/software/libunistring/manual/html_node/The-wchar_005ft-mess.html this same bug also exists for Solaris.

This patch is certainly wrong for NetBSD as the wchar_t encoding is up to the specific locale charset and *not* UCS-2 or UCS-4 for certain legacy encodings like the various shift encodings in East Asia.

How does the value of a macro get impacted by a runtime locale?

NetBSD doesn't set the macro. And yes, this is one of the fundamental design issues of long char literals. Section 2 of the following now 20 year old Itojun paper goes into some of the problems with the assumption of a single universal character set:
https://www.usenix.org/legacy/publications/library/proceedings/usenix01/freenix01/full_papers/hagino/hagino.pdf
Even an encoding that embeds ISO 10646 fully and uses a flag bit to denote values (entirely valid as Unicode is restricted to 21bit) should not get this macro set.

I think this patch as it stands would cause problems with libc implementations that put their #define somewhere other than than stdc-predef.h (eg, older versions of glibc that put it in features.h or other standard libraries that put it in yvals.h) due to the macro redefinition with a conflicting expansion. Such implementations are non-conforming but certainly exist. Likewise it could break the build for the same reason if we start implicitly including stdc-predef.h at some point. So I think that blindly predefining this macro will prove problematic in at least some circumstances.

Perhaps we can treat __STDC_ISO_10646__ as a builtin macro, with the following properties:

  • -Wbuiltin-macro-redefined is suppressed for it, at least in system headers -- if libc redefines it, that's fine, we'll use the libc definition.
  • The builtin macro behaves as if it's not defined if wchar_t literals don't use UTF-32 encoding.
  • When using UTF-32 encoding, the value of the macro is determined by asking the target how its libc behaves, unless we're in freestanding mode.
  • For targets that use a stdc-predef.h, the target determines the value by preprocessing stdc-predef.h and inspecting what value it sets the macro to. Other targets can do whatever makes sense for them, such as sniffing the libc version and defining the macro based on that, or just hardcoding the right value.

(And we can do similar things for other __STDC_* macros that get defined by stdc-predef.h by some standard libraries.)

This has some nice implications: we don't ever look at stdc-predef.h unless a libc-dependent macro is actually inspected, don't have conflicts between our macro and one defined by libc headers, can still produce a proper environment in freestanding mode, can still define the macro when using a libc version that doesn't happen to define the macro itself at all, and so on. Also, if the macro value ever needs to take compiler support into account (for example, if a Python-style L"\N{SNOWMAN}" syntax is added and the macro is supposed to indicate which character names are available too), we can easily support that. But it would be a little complex, and the macro wouldn't appear in -dM -E unless we added some special-casing for it.

I think this patch as it stands would cause problems with libc implementations that put their #define somewhere other than than stdc-predef.h (eg, older versions of glibc that put it in features.h or other standard libraries that put it in yvals.h) due to the macro redefinition with a conflicting expansion. Such implementations are non-conforming but certainly exist. Likewise it could break the build for the same reason if we start implicitly including stdc-predef.h at some point. So I think that blindly predefining this macro will prove problematic in at least some circumstances.

I don't understand what problem you expect to see here. We already suppress -Wmacro-redefined [which this falls under] for system headers -- doesn't that suffice?

I think this patch as it stands would cause problems with libc implementations that put their #define somewhere other than than stdc-predef.h (eg, older versions of glibc that put it in features.h or other standard libraries that put it in yvals.h) due to the macro redefinition with a conflicting expansion. Such implementations are non-conforming but certainly exist. Likewise it could break the build for the same reason if we start implicitly including stdc-predef.h at some point. So I think that blindly predefining this macro will prove problematic in at least some circumstances.

I don't understand what problem you expect to see here. We already suppress -Wmacro-redefined [which this falls under] for system headers -- doesn't that suffice?

I'd like to hear more about this as well, FWIW. I think we get all the benefits you describe by defining the macro in the frontend and letting the macro suppression work if there's a conflicting definition from a system header.

Btw @cor3ntin, it looks like you've got some failing test cases that need to be looked at:

https://reviews.llvm.org/harbormaster/unit/view/898745/
https://reviews.llvm.org/harbormaster/unit/view/898763/

cor3ntin updated this revision to Diff 361639.Jul 26 2021, 6:33 AM

Fix tests.

Herald added a project: Restricted Project. · View Herald TranscriptJul 26 2021, 6:33 AM

Perhaps a reasonable path forward here to address the BSD issue can be to add a targetinfo method:

/* Returns true if the expected encoding of wchar_t changes at runtime
   depending on locale for this target.
   Note that clang always encodes wide character literals as utf16/utf32,
   regardless. */
bool usesDynamicWCharEncoding();

which returns false by default, and true for FreeBSD, NetBSD, OpenBSD, and Solaris. Then, the condition for setting this define can be if(TI.getWCharWidth() >= 32 && !TI.usesDynamicWcharEncoding()).

That doesn't help the fact that wide char literals are effectively broken on those OSes, but oh well. Maybe someday they'll decide to switch to a consistent/documented wchar encoding, at which point clang can emit that (whatever it is). Or maybe someone will teach clang to emit an error or warning when using wide char literals on such targets. But I wouldn't hold my breath for either of those outcomes, and it seems fine to move forward here simply by exempting the known-to-be-problematic OSes.

I think this patch as it stands would cause problems with libc implementations that put their #define somewhere other than than stdc-predef.h (eg, older versions of glibc that put it in features.h or other standard libraries that put it in yvals.h) due to the macro redefinition with a conflicting expansion. Such implementations are non-conforming but certainly exist. Likewise it could break the build for the same reason if we start implicitly including stdc-predef.h at some point. So I think that blindly predefining this macro will prove problematic in at least some circumstances.

I don't understand what problem you expect to see here. We already suppress -Wmacro-redefined [which this falls under] for system headers -- doesn't that suffice?

I'd like to hear more about this as well, FWIW. I think we get all the benefits you describe by defining the macro in the frontend and letting the macro suppression work if there's a conflicting definition from a system header.

I'd expect we will break at least things like libc++ tests that build with -Wsystem-headers in the case where libc defines the macro. But if the problem is limited to old libc and a rare use case, and can easily be worked around with a -Wno flag, that's probably fine.

One benefit we don't get with this approach is providing the right value for the macro (without paying the cost of always including stdc-predefs.h). AFAICS, the only possible use for the value of the macro is to detect libc support, so having Clang pick a specific value seems wrong to me. In some ways I'd be more comfortable with this patch if we defined the macro to 1 and documented that we think WG14 was wrong to ask for a version number.

Perhaps a reasonable path forward here to address the BSD issue can be to add a targetinfo method:

/* Returns true if the expected encoding of wchar_t changes at runtime
   depending on locale for this target.
   Note that clang always encodes wide character literals as utf16/utf32,
   regardless. */
bool usesDynamicWCharEncoding();

which returns false by default, and true for FreeBSD, NetBSD, OpenBSD, and Solaris. Then, the condition for setting this define can be if(TI.getWCharWidth() >= 32 && !TI.usesDynamicWcharEncoding()).

This would work if it's necessary...

That doesn't help the fact that wide char literals are effectively broken on those OSes, but oh well. Maybe someday they'll decide to switch to a consistent/documented wchar encoding, at which point clang can emit that (whatever it is). Or maybe someone will teach clang to emit an error or warning when using wide char literals on such targets. But I wouldn't hold my breath for either of those outcomes, and it seems fine to move forward here simply by exempting the known-to-be-problematic OSes.

I still don't fully understand the original comment from Joerg. The encoding of L'a' cannot change at runtime; it's a literal whose encoding is decided entirely at compile time. @joerg -- did you mean that Clang produces a different literal encoding depending on the environment the host compiler is running in?

I'd expect we will break at least things like libc++ tests that build with -Wsystem-headers in the case where libc defines the macro. But if the problem is limited to old libc and a rare use case, and can easily be worked around with a -Wno flag, that's probably fine.

Doesn't look to be in an old libc: https://sourceware.org/git/?p=glibc.git;a=blob;f=include/stdc-predef.h;h=e130c462a7b71f87956a6d4b2bd69a38c9d9cb32;hb=refs/heads/master#l58

One benefit we don't get with this approach is providing the right value for the macro (without paying the cost of always including stdc-predefs.h). AFAICS, the only possible use for the value of the macro is to detect libc support, so having Clang pick a specific value seems wrong to me. In some ways I'd be more comfortable with this patch if we defined the macro to 1 and documented that we think WG14 was wrong to ask for a version number.

Yeah, I'm hoping to hear what WG14 has to say on this. My original thinking was that this macro is used to tell users and libc what version of Unicode wchar_t literal values are encoded in (if any), but seeing that both glibc and musl (https://git.musl-libc.org/cgit/musl/tree/include/stdc-predef.h#n4) define this macro themselves, I am less certain.

That doesn't help the fact that wide char literals are effectively broken on those OSes, but oh well. Maybe someday they'll decide to switch to a consistent/documented wchar encoding, at which point clang can emit that (whatever it is). Or maybe someone will teach clang to emit an error or warning when using wide char literals on such targets. But I wouldn't hold my breath for either of those outcomes, and it seems fine to move forward here simply by exempting the known-to-be-problematic OSes.

I still don't fully understand the original comment from Joerg. The encoding of L'a' cannot change at runtime; it's a literal whose encoding is decided entirely at compile time. @joerg -- did you mean that Clang produces a different literal encoding depending on the environment the host compiler is running in?

Exactly. Unfortunately, this is a problem people have a tendency to ignore.

Any string literals (narrow and wide) that cannot be interpreted the same way at compile time and runtime will lead to mojibake or bugs.
If we admit that the encoding can change between execution, or even during the same execution (gasp!), then clang should probably reject string literals that contain characters that are not represented identically in all the possible encodings that can be set at runtime.
(which would end up in the best case being a subset of the basic latin 1 block, or of we admit JIS... probably the empty set!)

One benefit we don't get with this approach is providing the right value for the macro (without paying the cost of always including stdc-predefs.h).

What do you mean by "right value", though? As Aaron pointed out, the value seems only dependent upon what characters can fit into a wchar_t, which is independent of what unicode version the libc supports. If ISO10646 defines a new character, you can store that into a wchar_t, and, say, decode/encode to utf-8 without a libc update. So that the exact value doesn't much matter for a 32-bit wchar_t, so long as ISO10646 doesn't expand the size of a character beyond 32 bits. (Which they won't -- it's stuck at 21-bits effectively permanently.)

AFAICS, the only possible use for the value of the macro is to detect libc support, so having Clang pick a specific value seems wrong to me. In some ways I'd be more comfortable with this patch if we defined the macro to 1 and documented that we think WG14 was wrong to ask for a version number.

At this point, there are 3 versions of ISO10646 that changed properties relevant to this:

  • the initial ISO/IEC 10646-1:1993 allows characters as being potentially up through 0x7FFFFFFF, but only defined characters up through 0xFFFF.
  • ISO/IEC 10646-2:2001 first actually defined characters beyond 0xFFFF,
  • and then ISO/IEC 10646:2012 and later versions cut the maximum character value down to 0x10FFFF.

So it's not true that the version number is without meaning -- only that it doesn't matter much anymore, because things have settled down. Quite possibly when they first defined this, they expected that toolchains with a 16-bit wchar_t might set it the define, since the standard -- at that point -- didn't have characters beyond 0xffff. But if those characters were indeed defined in a yet-to-be-released standard, then you'd have a problem. (As is the case today.)

And also, I think it'd be valid to #define __STDC_ISO_10646__ 200009L for 16bit wchar_t platforms. (Not sure if we should, but that would appear to be valid).

Yeah, I'm hoping to hear what WG14 has to say on this. My original thinking was that this macro is used to tell users and libc what version of Unicode wchar_t literal values are encoded in (if any), but seeing that both glibc and musl (https://git.musl-libc.org/cgit/musl/tree/include/stdc-predef.h#n4) define this macro themselves, I am less certain.

Musl has the define simply because GCC does not. That's not an independent confirmation of anything, simply it following the status-quo set by the initial choice of GCC in 2000 not to define the macro itself.

joerg added a comment.Jul 26 2021, 4:56 PM

I still don't fully understand the original comment from Joerg. The encoding of L'a' cannot change at runtime; it's a literal whose encoding is decided entirely at compile time. @joerg -- did you mean that Clang produces a different literal encoding depending on the environment the host compiler is running in?

It should, which is exactly why I consider non-trivial wchar_t literals fundamentally flawed as concept. AFAICT the only somewhat portable value is \0.

I'd expect we will break at least things like libc++ tests that build with -Wsystem-headers in the case where libc defines the macro. But if the problem is limited to old libc and a rare use case, and can easily be worked around with a -Wno flag, that's probably fine.

Doesn't look to be in an old libc: https://sourceware.org/git/?p=glibc.git;a=blob;f=include/stdc-predef.h;h=e130c462a7b71f87956a6d4b2bd69a38c9d9cb32;hb=refs/heads/master#l58

The "old libc" case is for old versions of glibc that put the macro in features.h instead of in stdc-predef.h. The macros in stdc-predef.h aren't a problem until / unless we start auto-including that header.

One benefit we don't get with this approach is providing the right value for the macro (without paying the cost of always including stdc-predefs.h).

What do you mean by "right value", though? As Aaron pointed out, the value seems only dependent upon what characters can fit into a wchar_t, which is independent of what unicode version the libc supports.

I don't see how that follows from the definition in the C standard; it says "every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character". This doesn't say anything about character or string literals, and for example mbstowcs stores characters in objects of type wchar_t too (it "stores not more than n wide characters into the array pointed to by pwcs"), so unless mbstowcs does the right thing I don't see how we can claim support for a new Unicode standard version. As far as I can tell, this macro is documenting a property of the complete implementation (compiler plus standard library), and should be set to the minimum of the version supported by the compiler and the version supported by the stdlib. I think it's OK for the compiler to say it supports *any* version, though, because we don't expect future Unicode versions to require any changes on our part. But they may require standard library changes.

If Aaron's checked with WG14 and the intent is for this to only constrain how literals are represented, and not the complete implementation, then I'm entirely fine with us defining the macro ourselves. But that's not the interpretation that several other vendors have taken. If we're confident that the intent is just that this macro lists (effectively) the latest version of the Unicode standard that we've heard of, we should let the various libc vendors that currently define the macro know that they're doing it wrong and the definition belongs in the compiler.

BTW, looks like the standard wording came from:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_273.htm

which indeed seems to suggest that the intent was to:

  1. ensure that WCHAR_MAX is at least the maximum character actually defined so far by the standard (which in past versions was 0xffff, and in current versions is 0x10ffff).
  2. ensure that for each of those characters defined by the standard, that it has the same numeric value stored in a wchar_t as the value the standard specifies.

The "old libc" case is for old versions of glibc that put the macro in features.h instead of in stdc-predef.h. The macros in stdc-predef.h aren't a problem until / unless we start auto-including that header.

The features.h header in every version of glibc since stdc-predef.h was split off has had #include <stdc-predef.h> in it. If the redefinition is a problem, it's still a problem in current versions.

One benefit we don't get with this approach is providing the right value for the macro (without paying the cost of always including stdc-predefs.h).

What do you mean by "right value", though? As Aaron pointed out, the value seems only dependent upon what characters can fit into a wchar_t, which is independent of what unicode version the libc supports.

I don't see how that follows from the definition in the C standard; it says "every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character". This doesn't say anything about character or string literals, and for example mbstowcs stores characters in objects of type wchar_t too (it "stores not more than n wide characters into the array pointed to by pwcs"), so unless mbstowcs does the right thing I don't see how we can claim support for a new Unicode standard version.
As far as I can tell, this macro is documenting a property of the complete implementation (compiler plus standard library), and should be set to the minimum of the version supported by the compiler and the version supported by the stdlib. I think it's OK for the compiler to say it supports *any* version, though, because we don't expect future Unicode versions to require any changes on our part. But they may require standard library changes.

But that's exactly it -- there are no library OR compiler changes changes required to remain conformant with this property when a new standard version is released. The range of values wchar_t needs to represent won't change. Even considering mbstowcs, there's no problem because it will already do the right thing, with zero changes, no matter how many new characters are defined within that valid range of 0x0-0x10ffff -- assuming that it does store unicode ordinal values into wchar_t in the first place. UTF-8/16/32 encoding and decoding are agnostic to which characters have been defined.

Of course, the library does need to make certain other changes corresponding to a new version, e.g. updating the tables for iswalpha to return true for newly defined alphabetical characters, but that functionality seems irrelevant to this define.

If Aaron's checked with WG14 and the intent is for this to only constrain how literals are represented, and not the complete implementation, then I'm entirely fine with us defining the macro ourselves. But that's not the interpretation that several other vendors have taken. If we're confident that the intent is just that this macro lists (effectively) the latest version of the Unicode standard that we've heard of, we should let the various libc vendors that currently define the macro know that they're doing it wrong and the definition belongs in the compiler.

It's surely intended to cover the complete system, since the standard doesn't consider "compiler" vs "libc" as separate things, they're both just components of the "implementation". But as per above comments, I don't think that changes the conclusion here.

If Aaron's checked with WG14 and the intent is for this to only constrain how literals are represented, and not the complete implementation, then I'm entirely fine with us defining the macro ourselves. But that's not the interpretation that several other vendors have taken. If we're confident that the intent is just that this macro lists (effectively) the latest version of the Unicode standard that we've heard of, we should let the various libc vendors that currently define the macro know that they're doing it wrong and the definition belongs in the compiler.

The discussion on the WG14 reflectors has wound down. Summarizing the key point of the discussion, Richard asked:

One specific example I'd like to be considered:
Suppose the C standard library implementation's mbstowcs converts a certain multi-byte character C to somewhere in the Unicode private use area, because Unicode version N doesn't have a corresponding character. Suppose further that the compiler is aware of Unicode version N+1, in which a character corresponding to C was added. Is an implementation formed by that combination of compiler and standard library, that defines __STDC_ISO_10646__ to N+1, conforming? Or is it non-conforming because it represents character C as something other than the corresponding short name from Unicode version N+1?

And David Keaton (long-time WG14 member and current convener) replied:

Yikes! It does indeed sound like the library would affect the value of __STDC_ISO_10646__ in that case. Thanks for clarifying the details.

There was no further discussion after that point, so I think the unofficial WG14 stance is that the compiler and the library need to collude on setting the value of that macro.

One specific example I'd like to be considered:
Suppose the C standard library implementation's mbstowcs converts a certain multi-byte character C to somewhere in the Unicode private use area, because Unicode version N doesn't have a corresponding character. Suppose further that the compiler is aware of Unicode version N+1, in which a character corresponding to C was added. Is an implementation formed by that combination of compiler and standard library, that defines __STDC_ISO_10646__ to N+1, conforming? Or is it non-conforming because it represents character C as something other than the corresponding short name from Unicode version N+1?

And David Keaton (long-time WG14 member and current convener) replied:

Yikes! It does indeed sound like the library would affect the value of __STDC_ISO_10646__ in that case. Thanks for clarifying the details.

There was no further discussion after that point, so I think the unofficial WG14 stance is that the compiler and the library need to collude on setting the value of that macro.

I don't think that scenario is valid. MBCS-to-unicode mappings are a part of the definition of the MBCS (sometimes officially, sometimes de-facto defined by major vendors), not in the definition of Unicode.

And in fact, we have a real-life example of this: the GB18030 encoding. That standard specifies 24 characters mappings to private-use-area unicode codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new version of Unicode coming out will not affect that. Rather, I should say, DID NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 were actually assigned official unicode codepoints by 2005 (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA code-points. The only way that can change is if GB18030 gets updated.

I do note that some implementations (e.g. glibc) have taken it upon themselves to modify the official GB18030 character mapping table, and to decode those 24 codepoints to the newly-defined unicode characters, instead of the specified PUA codepoints. But there's no way that can be described as a requirement -- it's not even technically correct!

One specific example I'd like to be considered:
Suppose the C standard library implementation's mbstowcs converts a certain multi-byte character C to somewhere in the Unicode private use area, because Unicode version N doesn't have a corresponding character. Suppose further that the compiler is aware of Unicode version N+1, in which a character corresponding to C was added. Is an implementation formed by that combination of compiler and standard library, that defines __STDC_ISO_10646__ to N+1, conforming? Or is it non-conforming because it represents character C as something other than the corresponding short name from Unicode version N+1?

And David Keaton (long-time WG14 member and current convener) replied:

Yikes! It does indeed sound like the library would affect the value of __STDC_ISO_10646__ in that case. Thanks for clarifying the details.

There was no further discussion after that point, so I think the unofficial WG14 stance is that the compiler and the library need to collude on setting the value of that macro.

I don't think that scenario is valid. MBCS-to-unicode mappings are a part of the definition of the MBCS (sometimes officially, sometimes de-facto defined by major vendors), not in the definition of Unicode.

Isn't that scenario basically the one we're in today where the compiler is unaware of what mappings the library provides?

And in fact, we have a real-life example of this: the GB18030 encoding. That standard specifies 24 characters mappings to private-use-area unicode codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new version of Unicode coming out will not affect that. Rather, I should say, DID NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 were actually assigned official unicode codepoints by 2005 (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA code-points. The only way that can change is if GB18030 gets updated.

I do note that some implementations (e.g. glibc) have taken it upon themselves to modify the official GB18030 character mapping table, and to decode those 24 codepoints to the newly-defined unicode characters, instead of the specified PUA codepoints. But there's no way that can be described as a requirement -- it's not even technically correct!

Does that imply that an implementation supporting that encoding can't define STDC_ISO_10646 because it doesn't meet the "has the same value as the short identifier" requirement?

@jyknight, are you on the WG14 reflectors btw? Would you like to carry on with this discussion over there (or would you like me to convey your viewpoints on your behalf)?

I don't think that scenario is valid. MBCS-to-unicode mappings are a part of the definition of the MBCS (sometimes officially, sometimes de-facto defined by major vendors), not in the definition of Unicode.

Isn't that scenario basically the one we're in today where the compiler is unaware of what mappings the library provides?

What I mean is: unicode does not define the mappings of a legacy MBCS byte sequence to a unicode character. It's simply out of scope. Only 3 encodings are defined by the Unicode standard (UTF-8, UTF-16, UTF-32). Mappings for other encodings are defined, instead, either by their own standard, or else simply chosen arbitrarily by a vendor.

And in fact, we have a real-life example of this: the GB18030 encoding. That standard specifies 24 characters mappings to private-use-area unicode codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new version of Unicode coming out will not affect that. Rather, I should say, DID NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 were actually assigned official unicode codepoints by 2005 (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA code-points. The only way that can change is if GB18030 gets updated.

I do note that some implementations (e.g. glibc) have taken it upon themselves to modify the official GB18030 character mapping table, and to decode those 24 codepoints to the newly-defined unicode characters, instead of the specified PUA codepoints. But there's no way that can be described as a requirement -- it's not even technically correct!

Does that imply that an implementation supporting that encoding can't define STDC_ISO_10646 because it doesn't meet the "has the same value as the short identifier" requirement?

No. The fact that the GB18030 encoding has an unfortunate mapping of its bytes to unicode characters does not change anything about __STD_ISO_10646__. It does not affect, "every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character" at all. All we're talking about here is differences of opinion between implementations as to which unicode character a given GB18030 byte sequence should to be translated as -- not the way in which a unicode character is stored in a wchar_t.

@jyknight, are you on the WG14 reflectors btw? Would you like to carry on with this discussion over there (or would you like me to convey your viewpoints on your behalf)?

I'm not. I'd be happy to have you convey my viewpoints.

@jyknight, are you on the WG14 reflectors btw? Would you like to carry on with this discussion over there (or would you like me to convey your viewpoints on your behalf)?

I'm not. I'd be happy to have you convey my viewpoints.

Thanks! I've passed them along to WG14 and will report back what I hear.

ThePhD added a subscriber: ThePhD.EditedSat, Sep 4, 10:54 AM

Hi, my name is JeanHeyd Meneide. I'm the Project Editor for C, but more importantly I'm the author of this paper: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm

This paper was accepted yesterday (September 3rd, 2021) into the C Standard, and (after I merge it and the like ~25 other papers + Annex X I need to merge), will appear in the next Draft of the C Standard.

As the paper's introduction and movtiation notes, the interpretation above that the locale-dependent encoding of wchar_t strings and char (MBS) strings for runtime functions like mbstowcs and wcstombs was not only a little bit silly, but also impossible to enforce properly on most systems without severe undue burden.

The wording of the paper explicitly removes the tie-in of the encoding of string literals and wide string literals to the library functions and instead makes them implementation-defined. This has no behavior change on any platform (it is, in a very strict sense, an expansion of the current definition and a standardization of existing practice amongst all implementations). What it does mean is that Clang and every other compiler - so long as they pick a ISO10646-code point capable encoding for their wchar_t literals - can define this preprocessor macro unconditionally. My understanding is that on most systems where things have not been patched / tweaked, this applies since Clang vastly prefers UTF-32 in most of its setups.

It is my strong recommendation this patch be accepted and made unconditional, both in anticipation of the upcoming standard and the widespread existing practice.

What it does mean is that Clang and every other compiler - so long as they pick a ISO10646-code point capable encoding for their wchar_t literals - can define this preprocessor macro unconditionally. My understanding is that on most systems where things have not been patched / tweaked, this applies since Clang vastly prefers UTF-32 in most of its setups.

It is my strong recommendation this patch be accepted and made unconditional, both in anticipation of the upcoming standard and the widespread existing practice.

The paper does not update 6.10.8.2 to add "when expressed as a wchar_t character constant" to the wording for the macro which is the subject of this patch.