This is an archive of the discontinued LLVM Phabricator instance.

Add char8_t support (C++20)
ClosedPublic

Authored by JDevlieghere on Aug 19 2019, 3:13 PM.

Details

Summary

This patch adds support for C++20 char8_t.

Original patch by James Blachly [1], modified by me.

[1] http://lists.llvm.org/pipermail/lldb-dev/2019-August/015393.html

Diff Detail

Repository
rL LLVM

Event Timeline

JDevlieghere created this revision.Aug 19 2019, 3:13 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2019, 3:13 PM
Herald added a subscriber: teemperor. · View Herald Transcript
JDevlieghere edited the summary of this revision. (Show Details)Aug 19 2019, 3:13 PM
JDevlieghere marked an inline comment as done.Aug 19 2019, 3:18 PM
JDevlieghere added inline comments.
lldb/packages/Python/lldbsuite/test/lang/cpp/char8_t/TestCxxChar8_t.py
26 ↗(On Diff #216002)

This should be 7

clayborg requested changes to this revision.Aug 19 2019, 3:19 PM
clayborg added a subscriber: clayborg.
clayborg added inline comments.
lldb/include/lldb/lldb-enumerations.h
170 ↗(On Diff #216002)

add to the end, or all LLDBRPC.framework calls that use this enumeration will fail for you.

lldb/source/Symbol/ClangASTContext.cpp
1383–1384 ↗(On Diff #216002)

are "const(T)" and "immutable(T)" the actualy type names or are they layers on top of a base "wchar" type? These shouldn't be needed if so as the base "wchar" type should end up handling the base type correctly. Remove?

1388–1389 ↗(On Diff #216002)

ditto

1393–1394 ↗(On Diff #216002)

ditto

This revision now requires changes to proceed.Aug 19 2019, 3:19 PM
  • Feedback Greg.
  • Remove dlang types.
JDevlieghere edited the summary of this revision. (Show Details)Aug 19 2019, 3:46 PM

Thank you for creating a revision and reviewing this.

I made inline comments on the test harness and Dlang types / qualifiers.

With removal of the Dlang types, where is the appropriate place to put them?
It is not clear to me whether language plugins can replace the functionality in ClangASTContext::GetBuiltinTypeForDWARFEncodingAndBitSize.

Thanks again

lldb/packages/Python/lldbsuite/test/lang/cpp/char8_t/TestCxxChar8_t.py
26 ↗(On Diff #216002)

I believe clang7 requires -fchar8_t, whereas the test harness here passes -std=c++2a ; char8_t is enabled via -std=c++2a beginning in clang-8

"(11): Prior to Clang 8, this feature is not enabled by -std=c++2a, but can be enabled with -fchar8_t. "

https://clang.llvm.org/cxx_status.html#p0482

lldb/source/Symbol/ClangASTContext.cpp
1383–1384 ↗(On Diff #216002)

In LDC (the LLVM D compiler), application of type qualifier immutable or const to a type T defines a new type.
I am not a DWARF expert, but running a sample program through lldb seems to confirm this:

error: need to add support for DW_TAG_base_type 'char' encoded with DW_ATE = 0x10, bit_size = 8
error: need to add support for DW_TAG_base_type 'const(char)' encoded with DW_ATE = 0x10, bit_size = 8
error: need to add support for DW_TAG_base_type 'immutable(char)' encoded with DW_ATE = 0x10, bit_size = 8

Whereas -- interestingly -- the reference compiler DMD encodes them differently (as char, const char, and char; respectively as the DWARF spec I guess has no qualifier for immutable). The LDC behavior is IMO more true to language spec.

labath added a subscriber: labath.Aug 20 2019, 1:43 AM
labath added inline comments.
lldb/packages/Python/lldbsuite/test/lang/cpp/char8_t/Makefile
4–6 ↗(On Diff #216008)

Replace with CFLAGS_EXTRAS+=-std=c++2a

lldb/packages/Python/lldbsuite/test/lang/cpp/char8_t/TestCxxChar8_t.py
46 ↗(On Diff #216008)

This comment looks misplaced.

lldb/packages/Python/lldbsuite/test/lang/cpp/char8_t/main.cpp
1–7 ↗(On Diff #216008)

It doesn't look like you actually need a running process for this at all. I think you could just make c8 a global variable and inspect it from the executable image directly (target variable c8).

lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
857–862 ↗(On Diff #216008)

It looks like you're also adding formatters for char8_t * and char8_t[N]. I guess those should be tested too...

JDevlieghere marked an inline comment as done.
  • Address Pavel's comments.

Thank you for creating a revision and reviewing this.

I made inline comments on the test harness and Dlang types / qualifiers.

With removal of the Dlang types, where is the appropriate place to put them?
It is not clear to me whether language plugins can replace the functionality in ClangASTContext::GetBuiltinTypeForDWARFEncodingAndBitSize.

Thanks again

Thanks James. I removed the DLang part from this patch because it's orthogonal to the char8_t support. The changes might be fine but they really should have their own differential and test. I'll try to put up another patch later today if I find the time.

This looks good to me, but why are we using a nul character to test utf8 support? Shouldn't we insert some funnier characters too? I mean, one of the advantages of unicode is that it should not be affected by the system code pages and such, so hopefully this would not cause problems even on some more exotic setups. (And I am pretty sure I remember already seeing some chinese chars in some of our data formatter tests)

lldb/packages/Python/lldbsuite/test/lang/cpp/char8_t/Makefile
6 ↗(On Diff #216175)

I'm pretty sure this part isn't needed too, particularly as we now don't even run the clean actions.

  • Remove clean rule
  • Use more readable names

This looks good to me, but why are we using a nul character to test utf8 support? Shouldn't we insert some funnier characters too? I mean, one of the advantages of unicode is that it should not be affected by the system code pages and such, so hopefully this would not cause problems even on some more exotic setups. (And I am pretty sure I remember already seeing some chinese chars in some of our data formatter tests)

I only glanced at the proposal, but unless I misunderstand the type only fits UTF-8 characters representable in 1 byte, which are basically just ASCII.

This looks good to me, but why are we using a nul character to test utf8 support? Shouldn't we insert some funnier characters too? I mean, one of the advantages of unicode is that it should not be affected by the system code pages and such, so hopefully this would not cause problems even on some more exotic setups. (And I am pretty sure I remember already seeing some chinese chars in some of our data formatter tests)

I only glanced at the proposal, but unless I misunderstand the type only fits UTF-8 characters representable in 1 byte, which are basically just ASCII.

I have now too glanced at the proposal (just the cppreference page, really :) ). I think I understand where you got this impression from, but I don't think that is fully correct. It is true that a *single* char8_t variable can hold only 8 bit UTF8 code units (*not* characters), but that is not surprising since UTF8 is a variable length encoding, so you can't have a type that matches one character exactly. However, an *array* of char8_t is a completely different thing, and I am pretty sure that these are intended to hold utf8 strings containing any utf8 characters (otherwise, it wouldn't really deserve to call itself a utf8 type), and so we should print (and test) it as regular utf8.

However, this actually surfaces the question of how should we format single char8_t variables. It makes sense to display the character value if the value happens to be ASCII, but I guess we shouldn't print something like "invalid utf8 character" if it does contain one unit of the multibyte characters.

This looks good to me, but why are we using a nul character to test utf8 support? Shouldn't we insert some funnier characters too? I mean, one of the advantages of unicode is that it should not be affected by the system code pages and such, so hopefully this would not cause problems even on some more exotic setups. (And I am pretty sure I remember already seeing some chinese chars in some of our data formatter tests)

I only glanced at the proposal, but unless I misunderstand the type only fits UTF-8 characters representable in 1 byte, which are basically just ASCII.

I have now too glanced at the proposal (just the cppreference page, really :) ). I think I understand where you got this impression from, but I don't think that is fully correct. It is true that a *single* char8_t variable can hold only 8 bit UTF8 code units (*not* characters), but that is not surprising since UTF8 is a variable length encoding, so you can't have a type that matches one character exactly. However, an *array* of char8_t is a completely different thing, and I am pretty sure that these are intended to hold utf8 strings containing any utf8 characters (otherwise, it wouldn't really deserve to call itself a utf8 type), and so we should print (and test) it as regular utf8.

Sounds like I simply misunderstood your earlier comment. I thought you meant putting a full UTF-8 *character* in a `char8_t.

However, this actually surfaces the question of how should we format single char8_t variables. It makes sense to display the character value if the value happens to be ASCII, but I guess we shouldn't print something like "invalid utf8 character" if it does contain one unit of the multibyte characters.

What about the current implementation that prints both the hex and the ASCII value?

  • Use UTF8 string
shafik added a subscriber: shafik.Aug 21 2019, 9:21 AM

This looks good to me, but why are we using a nul character to test utf8 support? Shouldn't we insert some funnier characters too? I mean, one of the advantages of unicode is that it should not be affected by the system code pages and such, so hopefully this would not cause problems even on some more exotic setups. (And I am pretty sure I remember already seeing some chinese chars in some of our data formatter tests)

I only glanced at the proposal, but unless I misunderstand the type only fits UTF-8 characters representable in 1 byte, which are basically just ASCII.

I have now too glanced at the proposal (just the cppreference page, really :) ). I think I understand where you got this impression from, but I don't think that is fully correct. It is true that a *single* char8_t variable can hold only 8 bit UTF8 code units (*not* characters), but that is not surprising since UTF8 is a variable length encoding, so you can't have a type that matches one character exactly. However, an *array* of char8_t is a completely different thing, and I am pretty sure that these are intended to hold utf8 strings containing any utf8 characters (otherwise, it wouldn't really deserve to call itself a utf8 type), and so we should print (and test) it as regular utf8.

However, this actually surfaces the question of how should we format single char8_t variables. It makes sense to display the character value if the value happens to be ASCII, but I guess we shouldn't print something like "invalid utf8 character" if it does contain one unit of the multibyte characters.

You may find the the C++ Evolution Working Groups entry on [N4197 Adding u8 character literals, [tiny] Why no u8 character literals?](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4540.html#119) and the proposal that add char8_t helpful in understanding the rationale and the proposal for char8_t runs through a lot of examples.

I changed the test to use frame variable again. With target variable the UTF-8 formatting doesn't work. Given that this patch just copies the Char16 and Char32 implementation, I think that's something for a different patch. I'll file a PR when this gets in.

(lldb) target variable ab
(const char8_t *) ab = 0x0000000100000fa6
(lldb) target variable abc
(char8_t [9]) abc = {
  [0] = 0xe4 u8'
                  [1] = 0xbd u8''                                                                                                                                                                                                                                                                                                                     [2] = 0xa0 u8''                                                                                                                                                                                                                                                                                                                                     [3] = 0xe5 u8'
                  [4] = 0xa5 u8''
  [5] = 0xbd u8''
  [6] = 0x00 u8'\0'
  [7] = 0x00 u8'\0'
  [8] = 0x00 u8'\0'
}

Sounds like I simply misunderstood your earlier comment. I thought you meant putting a full UTF-8 *character* in a `char8_t.

Ah yes, I can see how that request could have been interpreted this way. I'm glad that we understand each other.

However, this actually surfaces the question of how should we format single char8_t variables. It makes sense to display the character value if the value happens to be ASCII, but I guess we shouldn't print something like "invalid utf8 character" if it does contain one unit of the multibyte characters.

What about the current implementation that prints both the hex and the ASCII value?

I think that's fine. lgtm.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 21 2019, 2:34 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2019, 2:34 PM

A couple inline comments. I think this is looking pretty good.

lldb/trunk/packages/Python/lldbsuite/test/lang/cpp/char8_t/main.cpp
1

Is this include necessary?

lldb/trunk/source/Symbol/ClangASTContext.cpp
1391

I think the current style is to omit the else keywords on these, since each if returns. That would at least be consistent with several cases above.