This is an archive of the discontinued LLVM Phabricator instance.

[Lex] Fix for char32_t literal truncation on 16 bit architectures
Needs RevisionPublic

Authored by SebastianPerta on Jun 8 2022, 4:41 PM.

Details

Summary

On 16 bit architectures char32_t literals are truncated, for example U'\U00064321' will be truncated to 0x4321.
The issue can be seen using the RL78 backend which I announced a while ago (https://lists.llvm.org/pipermail/llvm-dev/2020-April/140546.html) and I'm ready to upstream.
Upstream, the problem can be observed on MSP430, however this patch is not sufficient in case of MSP430 since Char32Type is left to the default type UnsignedInt which is 16 bit in case of MSP430 (set in TargetInfo.cpp). On RL78 I set it to UnsignedLong just like in case of AVR (see AVR.h).

Regarding testing, I found the problem using the following test from the GCC regression:
gcc/testsuite/g++.dg/ext/utf32-1.C
I'm happy to write a new test if I can get any pointers where and how to write it (the test fails at execution so not sure how to test it without executing it).

Diff Detail

Event Timeline

SebastianPerta created this revision.Jun 8 2022, 4:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 8 2022, 4:41 PM
Herald added a subscriber: dylanmckay. · View Herald Transcript
SebastianPerta requested review of this revision.Jun 8 2022, 4:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 8 2022, 4:41 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

This sounds correct to me, though I don't feel like an expert here.
Upstream in non-experimental IntWidth == Char32Width except on AVR, so it won't actually break anything.

Regarding testing, it seems like this should change the diagnostic emitted for long x = 'abc' from "Character constant too long for its type" to "Multi-character character constant" when the target is AVR?

aaron.ballman added a subscriber: tahonermann.

On 16 bit architectures char32_t literals are truncated, for example U'\U00064321' will be truncated to 0x4321.

Is that valid?

https://eel.is/c++draft/basic.types#basic.fundamental-9.sentence-2 and also C2x 7.29p2 (which has the same underlying type requirement) both say that the underlying type for char32_t is uint_least32_t. Adding @tahonermann to verify.

On 16 bit architectures char32_t literals are truncated, for example U'\U00064321' will be truncated to 0x4321.

Is that valid?

https://eel.is/c++draft/basic.types#basic.fundamental-9.sentence-2 and also C2x 7.29p2 (which has the same underlying type requirement) both say that the underlying type for char32_t is uint_least32_t. Adding @tahonermann to verify.

I think Sebastian is describing the current broken behavior that this patch fixes.
So if you're saying "that sounds wrong" then I think we're all in agreement :-)

(Sorry if I'm misreading either Sebastian's comment or yours).

On 16 bit architectures char32_t literals are truncated, for example U'\U00064321' will be truncated to 0x4321.

Is that valid?

https://eel.is/c++draft/basic.types#basic.fundamental-9.sentence-2 and also C2x 7.29p2 (which has the same underlying type requirement) both say that the underlying type for char32_t is uint_least32_t. Adding @tahonermann to verify.

I think Sebastian is describing the current broken behavior that this patch fixes.
So if you're saying "that sounds wrong" then I think we're all in agreement :-)

(Sorry if I'm misreading either Sebastian's comment or yours).

Oh! I hadn't read it that way initially! But yes, I'm saying "this behavior sounds wrong and should be corrected." I misread the code and missed that it was fixing in the correct direction. :-)

I think this should have a test, and we can probably use _Static_assert for it: https://godbolt.org/z/611bx47qc

Regarding testing, it seems like this should change the diagnostic emitted for long x = 'abc' from "Character constant too long for its type" to "Multi-character character constant" when the target is AVR?

Currently on RL78 it returns this warning without any extra changes:

warning: multi-character character constant [-Wmultichar]
long x = 'abc';
         ^

@sammccall this is what you expect, correct?
I haven't built LLVM for AVR yet, I will do it and get back to you.

I think this should have a test, and we can probably use _Static_assert for it: https://godbolt.org/z/611bx47qc

Thank you for the help @aaron.ballman , I will update the patch with a test case using _Static_assert.

tahonermann requested changes to this revision.Jun 9 2022, 2:39 PM

When submitting patches, please create the diff with context so that the code can be navigated in Phabricator. See https://llvm.org/docs/Phabricator.html#phabricator-request-review-web for details on how to do so.

clang/lib/Lex/LiteralSupport.cpp
1600

I don't think this is quite right. For the code that follows this change to work as intended and issue the "Character constant too long for its type" diagnostic, the width needs to match that of int. This is required for multicharacter literals (they have type int) so that an appropriate diagnostic is issued for 'xxxxx' for targets that have a 32-bit int (or for 'xxx' for targets that have a 16-bit int)`.

Additionally, the type of a character constant in C is int.

I think what is needed is something like:

unsigned BitWidth = getCharWidth(Kind, PP.getTargetInfo());
if (IsMultiChar || !PP.getLangOpts().CPlusPlus)
  BitWidth = PP.getTargetInfo().getIntWidth();
llvm::APInt LitVal(BitWidth, 0);
This revision now requires changes to proceed.Jun 9 2022, 2:39 PM
sammccall added inline comments.Jun 20 2022, 6:20 AM
clang/lib/Lex/LiteralSupport.cpp
1600

Thanks for pointing this out.

My reading of https://eel.is/c++draft/lex.ccon#2 is that a multi-char char literal with a L/u8/u/U prefix is not int but the respective character types, so the conditions here are even a little *more* complicated than you suggest :-(

tahonermann added inline comments.Jun 21 2022, 9:48 AM
clang/lib/Lex/LiteralSupport.cpp
1600

That is partially correct. Per (lex.ccon)p1 sentence 3, multicharacter literals are not allowed to have a u8, u, or U prefix. A multicharacter literal with a L prefix does have a wchar_t type, but Clang does not currently support L prefixed multicharacter literals (multicharacter literals are conditionally-supported per (lex.ccon)p1 sentence 4)

All that being said, I would not be surprised if some additional conditions are needed. I don't recall where Clang diagnoses the ill-formed and unsupported cases.

Additionally, the type of a character constant in C is int.

This means that char32_t c4 = U'\U00064321'; is invalid in C. I know that is clang more strict with the standard than GCC, however I would like to mention that in GCC the value is not truncated to 16 bit which is I found this problem originally. I suppose we want to stick with the standard in clang.

My reading of https://eel.is/c++draft/lex.ccon#2 is that a multi-char char literal with a L/u8/u/U prefix is not int but the respective character types

As explained by @tahonermann is just in case of C in case of C++ literals have their respective character types:
I checked char8_t, char16_t and char32_t with u8,u and U respectively and the following line of code by @tahonermann works in all 3 cases.
unsigned BitWidth = getCharWidth(Kind, PP.getTargetInfo());
Since Kind will be utf8_char_constant, utf16_char_constant and utf32_char_constant respectively.
And since L is not supported I think all cases are accounted for.
Or am I missing something?
In case not, should I continue to put another patch together with suggested changes?

Additionally, the type of a character constant in C is int.

This means that char32_t c4 = U'\U00064321'; is invalid in C

No. A character constant that does not have an encoding prefix has type int in C. Character constants that have an encoding prefix have the type indicated by the encoding prefix. U'\U00064321' has type char32_t in C (a typedef of uint_least32_t which is a typedef of some integer type).

When posting an updated patch, please make sure to add the -U99999 option to diff to ensure that surrounding code context is available for code reviewers.