This is an archive of the discontinued LLVM Phabricator instance.

Created uChar implementation for libc
Needs ReviewPublic

Authored by MarcusJohnson91 on Feb 4 2020, 7:11 PM.

Details

Diff Detail

Event Timeline

MarcusJohnson91 created this revision.Feb 4 2020, 7:11 PM
MarcusJohnson91 created this object with visibility "All Users".
abrachet edited reviewers, added: sivachandra; removed: libc-commits.Feb 4 2020, 7:26 PM
abrachet changed the visibility from "All Users" to "Public (No Login Required)".Feb 4 2020, 7:44 PM

These need tests.

libc/src/uchar/CMakeLists.txt
7 ↗(On Diff #242504)

remove comments

36 ↗(On Diff #242504)

newline

libc/src/uchar/c16rtomb.cpp
15 ↗(On Diff #242504)

Too long I guess. clang-format

16 ↗(On Diff #242504)

Remove ULL.

18–21 ↗(On Diff #242504)

We don't do block comments. Use //. Anyway I don't think you need to document library functions like this.

29 ↗(On Diff #242504)

This is not how pointers work. I'm not sure what the point of this is.

libc/src/uchar/c16rtomb.h
16 ↗(On Diff #242504)

No indentation. These all need clang-format run on them.

libc/src/uchar/c32rtomb.cpp
19–20 ↗(On Diff #242504)

We don't align like this. clang-format will fix these.

libc/src/uchar/mbrtoc16.cpp
19–20 ↗(On Diff #242504)

I think case statements should be aligned with the switch. clang-format will tell us either way :)

libc/src/uchar/mbrtoc32.cpp
17 ↗(On Diff #242504)

What's the point of this?

Thanks a lot for the patch. I think @abrachet has already pointed out few problems. From my side, I am afraid I do not have time until the Friday to take a good look at it.

MarcusJohnson91 updated this revision to Diff 242877.EditedFeb 6 2020, 5:37 AM
MarcusJohnson91 marked an inline comment as done.

Clang-formatted, and replaced calloc with calls to new, as well as removed ULL from size_t variable initializations

also removed the block comment that accidentally slipped through.

As for tests, I'm not sure where to begin with that, can anyone point me in the right direction?

Fixed typo's.

Added trailing newline to CMakeLists

Clang-formatted, and replaced calloc with calls to new, as well as removed ULL from size_t variable initializations

also removed the block comment that accidentally slipped through

and fixed type with variable c16 being written as C16

For reference this is musl's implementation of c16rtomb http://git.musl-libc.org/cgit/musl/tree/src/multibyte/c16rtomb.c.

As for tests, I'm not sure where to begin with that, can anyone point me in the right direction?

You could replicate the example that cppreference has. https://en.cppreference.com/w/c/string/multibyte/c16rtomb. Otherwise where have you seen these functions use in the wild? I have never seen it so I can't give any real world examples unfortunately.

libcxx has pretty sparse tests https://github.com/llvm/llvm-project/tree/master/libcxx/test/std/strings https://github.com/llvm/llvm-project/tree/master/libcxx/test/libcxx/strings I'm afraid.

You could look at the standard library of more modern languages. For example Rust made a big deal about Utf16 strings, you could hunt down those tests https://doc.rust-lang.org/std/primitive.str.html.

And in musl's tests (cloned from git://repo.or.cz/libc-test) they have libc-test/src/functional/clocale_mbfuncs.c, which is the closest I could find.

libc/src/uchar/c16rtomb.cpp
25 ↗(On Diff #242879)

For what its worth, this allocates one char and initializes it to StringSize. If you wanted to allocate StringSize chars you need new char[StringSize]. But new isn't safe to use for us, and I think it wouldn't even link because we have -nostdlib specified.

void changeParam(int a) {
    a = 5;
}

int main() {
   int a = 0;
   changeParam(a);
   assert(!a); // passes, a not changed.
}

The same is true with pointers if I pass my char * to c16rtomb, if you reassign to it by malloc or new, it changes the local variable but not the callers variable. These functions (unless passed a nullptr) assume that s is already pointing to enough valid memory.

abrachet added inline comments.Feb 6 2020, 11:07 AM
libc/src/uchar/c16rtomb.h
1–2 ↗(On Diff #242879)

clang-format did this because the line was too long, you should remove a few - to shorten it to I think 81 characters it might be 80.

Couple of high level comments:

  1. Looks like the public uchar.h header or a rule to generate it are missing. Prefer generation. See string.h for example.
  2. To define char16_t and char32_t types, use the predefined macro __UINT_LEAST16__TYPE__ and __UINT_LEAST32_TYPE__. It would have been ideal if the free standing stdint.h supports __need_* macros, but it does not.

As for tests, I'm not sure where to begin with that, can anyone point me in the right direction?

Can you write unit tests using manually sythensized byte sequences.

libc/src/uchar/c16rtomb.cpp
16 ↗(On Diff #242879)

It would be good if you can add comments explaining the different cases that being handled here.

25 ↗(On Diff #242879)

If I am reading the standard correct, one should not need to allocate memory at all. It should be assumed that s is pointing to an appropriately sized byte array. So, one should not need new/malloc and any other allocator. I also think the same holds for the other functions as well.

MarcusJohnson91 marked an inline comment as done.

Fixed the header blocks and reformatted

Removed all of the new calls, starting work on the latest comments

MarcusJohnson91 marked an inline comment as done.

uint_leastX_t -> UINT_LEASTXTYPE__ in char16_t and char32_t typedefs

Added uchar.h.def based on string.h.def, but I don't understand what it's actually doing.

Added uchar.h.def based on string.h.def, but I don't understand what it's actually doing.

Sorry, the documentation is not very clear at this moment. I will prepare a patch as an example for you.

Bump

libc/src/uchar/c16rtomb.cpp
16 ↗(On Diff #242879)

0xD800 - 0xDFFF is a surrogate pair, the c16 parameter can not contain the 2 16 bit code units required to actually decode it to the proper UTF-32 codepoint.

so should I try to make the decoder stateful in order to make that happen (not sure if that's even possible), or should I just replace it with the invalid replacement character?

25 ↗(On Diff #242879)

Ahh, I appreciate the tip on new, I'm new to C++.

As for if the string is allocated I'm trying to look up some examples of how these APIs are called on github, it'll be a lot harder to verify if a string can hold a codepoint/codeunit if it's pre-allocated, because each index would look like a null terminater to begin with, right?

libc/src/uchar/mbrtoc32.cpp
17 ↗(On Diff #242504)

What's the point of what?
the char32_t variable being initialized as an array instead of a plain value?
The standard requires it, I agree it's dumb, but that's the API callers expect.

Ah sorry, this completely slipped from my radar. I am currently sick, but will try get to this before the end of this week.

Rebased and Squashed Uchar patch

Fixed a minor clang-format line length issue in the source files, and removed a duplicated cmake entry

I'm refactoring this around wchar (and adding basic wchar support as well)

My plan is to wrap the wchar implementation around the uchar one, so this patch isn't ready yet.

Rebased on master

Sorry it took me this long to comment here. Part of it was because I had to educate myself about mutli-byte characters and wide characters. Few high level questions:

  1. Are the functions char[16|32]rtomb doing a UTF-16|32 to UTF-8 conversion? Per the standard, they should convert to the current locale? May be UTF-8 is an acceptable target encoding. In which case, should we have an error reporting scheme when the locale is not set to UTF-8?
  2. You mention building multi-byte support over wide char support. However, if I am reading it right, it doesn't seem like it?

A generic comment: I think you are not using pointers correctly. For example, in the mbrtoc16 function, you have this:

if ((s & 0x80) == 0) {
    // ASCII
} else if ((s & 0x80) == ) 
...

s is of type const char *restrict. Seems to me like that the intention here is to compare first the character *s?

In few other places, I see incomplete code. Is the patch ready for review?

Are the functions char[16|32]rtomb doing a UTF-16|32 to UTF-8 conversion? Per the standard, they should convert to the current locale?

Which standard are you referring to? I'm reading the C18 (N2176) standard just to be sure we're on the same page.

As for the c16rtomb and c32rtomb functions, they're defined in the uchar.h header, and the C18 standard says "7.28 Unicode utilities <uchar.h>", I take that to mean that these functions should convert between different representations of Unicode.

  1. You mention building multi-byte support over wide char support. However, if I am reading it right, it doesn't seem like it?

I not sure what you're referring to here? Do you mean where I said that I only intend to implement enough of the wchar header to support uchar, aka declare the mbstate_t type, and the mbstate_init function; and only becaose the standard declares this type and function in wchar, maybe it'd be better to put these parts in an internal uchar implementation file?

A generic comment: I think you are not using pointers correctly. For example, in the mbrtoc16 function, you have this:

if ((s & 0x80) == 0) {
    // ASCII
} else if ((s & 0x80) == ) 
...

s is of type const char *restrict. Seems to me like that the intention here is to compare first the character *s?

Yes, you're right I originally wrote the code using different variable names using a very different implementation, and I messed up the syntax when copypasting the correct variable names and overlooked them when changing them.

In few other places, I see incomplete code. Is the patch ready for review?

Not quite, there's still a few things I'm working on, and I'm working on a few different patches at the same time as well.

MarcusJohnson91 added a comment.EditedMay 22 2020, 1:24 PM

Hey guys: @sivachandra.

My clang-format patch landed yesterday so I'm working again on my libc patch to add uchar and maybe wchar stuff.

I've been rewriting c16rtomb and it works a lot better now but I was wondering how we're supposed to handle errors in llvm-libc?

like, if there's a low surrogate without a high surrogate preceding it, that's an error.

a lone surrogate is an error as well,

etc.

please tell me we don't have to use errno

Hey guys: @sivachandra.

My clang-format patch landed yesterday so I'm working again on my libc patch to add uchar and maybe wchar stuff.

I've been rewriting c16rtomb and it works a lot better now but I was wondering how we're supposed to handle errors in llvm-libc?

LLVM-libc should do what the standards say. For example, if the standards say that errno has to be set to a certain value to indicate error, then LLVM-libc should do that. Likewise, if standards say that the function in question should return an error value, then LLVM libc should do that.

I can probably give more specific answers if you can point out the particular function you are asking about.

Thanks,
Siva Chandra

Hey guys, I'm rebasing and starting work on this again, sorry for the wait I moved 2000 miles from Michigan to Oregon.

I have a question tho.

Uchar and wchar both rely on mbstate_t which is a global variable for their conversions.

I'm currently using an enum to hold the valid states different conversions can be in, and I'm wondering how it should all be layered together?

Should uchar depend on wchar, should wchar depend on uchar, should there be a private header that both of them use?

and also what about namespaces, should the enum be in a private namespace and the mbstate_t global be in a public namespace?

Any clarification would be great, thanks.

Hey guys, I'm rebasing and starting work on this again, sorry for the wait I moved 2000 miles from Michigan to Oregon.

I have a question tho.

Uchar and wchar both rely on mbstate_t which is a global variable for their conversions.

I'm currently using an enum to hold the valid states different conversions can be in, and I'm wondering how it should all be layered together?

The standard says mbstate_t should be of a struct type: https://en.cppreference.com/w/c/string/multibyte/mbstate_t

But, I didn't find anything which says the state is a libc maintained global state. So, other than the fact that we need to define that struct in a common place like this, I do not see anything affecting the layering. Am I missing something?

Should uchar depend on wchar, should wchar depend on uchar, should there be a private header that both of them use?

and also what about namespaces, should the enum be in a private namespace and the mbstate_t global be in a public namespace?

Yes, you can choose to keep the "internals" of mbstate_t in an internal namespace.

Any clarification would be great, thanks.