On 16 bit architectures char32_t literals are truncated, for example U'\U00064321' will be truncated to 0x4321.
The issue can be seen using the RL78 backend which I announced a while ago (https://lists.llvm.org/pipermail/llvm-dev/2020-April/140546.html) and I'm ready to upstream.
Upstream, the problem can be observed on MSP430, however this patch is not sufficient in case of MSP430 since Char32Type is left to the default type UnsignedInt which is 16 bit in case of MSP430 (set in TargetInfo.cpp). On RL78 I set it to UnsignedLong just like in case of AVR (see AVR.h).
Regarding testing, I found the problem using the following test from the GCC regression:
gcc/testsuite/g++.dg/ext/utf32-1.C
I'm happy to write a new test if I can get any pointers where and how to write it (the test fails at execution so not sure how to test it without executing it).
I don't think this is quite right. For the code that follows this change to work as intended and issue the "Character constant too long for its type" diagnostic, the width needs to match that of int. This is required for multicharacter literals (they have type int) so that an appropriate diagnostic is issued for 'xxxxx' for targets that have a 32-bit int (or for 'xxx' for targets that have a 16-bit int)`.
Additionally, the type of a character constant in C is int.
I think what is needed is something like: