Details
- Reviewers
- PaulkaToast - sivachandra 
Diff Detail
Event Timeline
These need tests.
| libc/src/uchar/CMakeLists.txt | ||
|---|---|---|
| 7 ↗ | (On Diff #242504) | remove comments | 
| 36 ↗ | (On Diff #242504) | newline | 
| libc/src/uchar/c16rtomb.cpp | ||
| 15 ↗ | (On Diff #242504) | Too long I guess. clang-format | 
| 16 ↗ | (On Diff #242504) | Remove ULL. | 
| 18–21 ↗ | (On Diff #242504) | We don't do block comments. Use //. Anyway I don't think you need to document library functions like this. | 
| 29 ↗ | (On Diff #242504) | This is not how pointers work. I'm not sure what the point of this is. | 
| libc/src/uchar/c16rtomb.h | ||
| 16 ↗ | (On Diff #242504) | No indentation. These all need clang-format run on them. | 
| libc/src/uchar/c32rtomb.cpp | ||
| 19–20 ↗ | (On Diff #242504) | We don't align like this. clang-format will fix these. | 
| libc/src/uchar/mbrtoc16.cpp | ||
| 19–20 ↗ | (On Diff #242504) | I think case statements should be aligned with the switch. clang-format will tell us either way :) | 
| libc/src/uchar/mbrtoc32.cpp | ||
| 17 ↗ | (On Diff #242504) | What's the point of this? | 
Thanks a lot for the patch. I think @abrachet has already pointed out few problems. From my side, I am afraid I do not have time until the Friday to take a good look at it.
Clang-formatted, and replaced calloc with calls to new, as well as removed ULL from size_t variable initializations
also removed the block comment that accidentally slipped through.
As for tests, I'm not sure where to begin with that, can anyone point me in the right direction?
Added trailing newline to CMakeLists
Clang-formatted, and replaced calloc with calls to new, as well as removed ULL from size_t variable initializations
also removed the block comment that accidentally slipped through
and fixed type with variable c16 being written as C16
For reference this is musl's implementation of c16rtomb http://git.musl-libc.org/cgit/musl/tree/src/multibyte/c16rtomb.c.
As for tests, I'm not sure where to begin with that, can anyone point me in the right direction?
You could replicate the example that cppreference has. https://en.cppreference.com/w/c/string/multibyte/c16rtomb. Otherwise where have you seen these functions use in the wild? I have never seen it so I can't give any real world examples unfortunately.
libcxx has pretty sparse tests https://github.com/llvm/llvm-project/tree/master/libcxx/test/std/strings https://github.com/llvm/llvm-project/tree/master/libcxx/test/libcxx/strings I'm afraid.
You could look at the standard library of more modern languages. For example Rust made a big deal about Utf16 strings, you could hunt down those tests https://doc.rust-lang.org/std/primitive.str.html.
And in musl's tests (cloned from git://repo.or.cz/libc-test) they have libc-test/src/functional/clocale_mbfuncs.c, which is the closest I could find.
| libc/src/uchar/c16rtomb.cpp | ||
|---|---|---|
| 25 ↗ | (On Diff #242879) | For what its worth, this allocates one char and initializes it to StringSize. If you wanted to allocate StringSize chars you need new char[StringSize]. But new isn't safe to use for us, and I think it wouldn't even link because we have -nostdlib specified. void changeParam(int a) { a = 5; } int main() { int a = 0; changeParam(a); assert(!a); // passes, a not changed. } The same is true with pointers if I pass my char * to c16rtomb, if you reassign to it by malloc or new, it changes the local variable but not the callers variable. These functions (unless passed a nullptr) assume that s is already pointing to enough valid memory. | 
| libc/src/uchar/c16rtomb.h | ||
|---|---|---|
| 1–2 ↗ | (On Diff #242879) | clang-format did this because the line was too long, you should remove a few - to shorten it to I think 81 characters it might be 80. | 
Couple of high level comments:
- Looks like the public uchar.h header or a rule to generate it are missing. Prefer generation. See string.h for example.
- To define char16_t and char32_t types, use the predefined macro __UINT_LEAST16__TYPE__ and __UINT_LEAST32_TYPE__. It would have been ideal if the free standing stdint.h supports __need_* macros, but it does not.
Can you write unit tests using manually sythensized byte sequences.
| libc/src/uchar/c16rtomb.cpp | ||
|---|---|---|
| 16 ↗ | (On Diff #242879) | It would be good if you can add comments explaining the different cases that being handled here. | 
| 25 ↗ | (On Diff #242879) | If I am reading the standard correct, one should not need to allocate memory at all. It should be assumed that s is pointing to an appropriately sized byte array. So, one should not need new/malloc and any other allocator. I also think the same holds for the other functions as well. | 
Added uchar.h.def based on string.h.def, but I don't understand what it's actually doing.
Sorry, the documentation is not very clear at this moment. I will prepare a patch as an example for you.
Bump
| libc/src/uchar/c16rtomb.cpp | ||
|---|---|---|
| 16 ↗ | (On Diff #242879) | 0xD800 - 0xDFFF is a surrogate pair, the c16 parameter can not contain the 2 16 bit code units required to actually decode it to the proper UTF-32 codepoint. so should I try to make the decoder stateful in order to make that happen (not sure if that's even possible), or should I just replace it with the invalid replacement character? | 
| 25 ↗ | (On Diff #242879) | Ahh, I appreciate the tip on new, I'm new to C++. As for if the string is allocated I'm trying to look up some examples of how these APIs are called on github, it'll be a lot harder to verify if a string can hold a codepoint/codeunit if it's pre-allocated, because each index would look like a null terminater to begin with, right? | 
| libc/src/uchar/mbrtoc32.cpp | ||
| 17 ↗ | (On Diff #242504) | What's the point of what?  | 
Ah sorry, this completely slipped from my radar. I am currently sick, but will try get to this before the end of this week.
Fixed a minor clang-format line length issue in the source files, and removed a duplicated cmake entry
I'm refactoring this around wchar (and adding basic wchar support as well)
My plan is to wrap the wchar implementation around the uchar one, so this patch isn't ready yet.
Sorry it took me this long to comment here. Part of it was because I had to educate myself about mutli-byte characters and wide characters. Few high level questions:
- Are the functions char[16|32]rtomb doing a UTF-16|32 to UTF-8 conversion? Per the standard, they should convert to the current locale? May be UTF-8 is an acceptable target encoding. In which case, should we have an error reporting scheme when the locale is not set to UTF-8?
- You mention building multi-byte support over wide char support. However, if I am reading it right, it doesn't seem like it?
A generic comment: I think you are not using pointers correctly. For example, in the mbrtoc16 function, you have this:
if ((s & 0x80) == 0) {
    // ASCII
} else if ((s & 0x80) == ) 
...s is of type const char *restrict. Seems to me like that the intention here is to compare first the character *s?
In few other places, I see incomplete code. Is the patch ready for review?
Which standard are you referring to? I'm reading the C18 (N2176) standard just to be sure we're on the same page.
As for the c16rtomb and c32rtomb functions, they're defined in the uchar.h header, and the C18 standard says "7.28 Unicode utilities <uchar.h>", I take that to mean that these functions should convert between different representations of Unicode.
- You mention building multi-byte support over wide char support. However, if I am reading it right, it doesn't seem like it?
I not sure what you're referring to here? Do you mean where I said that I only intend to implement enough of the wchar header to support uchar, aka declare the mbstate_t type, and the mbstate_init function; and only becaose the standard declares this type and function in wchar, maybe it'd be better to put these parts in an internal uchar implementation file?
A generic comment: I think you are not using pointers correctly. For example, in the mbrtoc16 function, you have this:
if ((s & 0x80) == 0) { // ASCII } else if ((s & 0x80) == ) ...s is of type const char *restrict. Seems to me like that the intention here is to compare first the character *s?
Yes, you're right I originally wrote the code using different variable names using a very different implementation, and I messed up the syntax when copypasting the correct variable names and overlooked them when changing them.
In few other places, I see incomplete code. Is the patch ready for review?
Not quite, there's still a few things I'm working on, and I'm working on a few different patches at the same time as well.
Hey guys: @sivachandra.
My clang-format patch landed yesterday so I'm working again on my libc patch to add uchar and maybe wchar stuff.
I've been rewriting c16rtomb and it works a lot better now but I was wondering how we're supposed to handle errors in llvm-libc?
like, if there's a low surrogate without a high surrogate preceding it, that's an error.
a lone surrogate is an error as well,
etc.
please tell me we don't have to use errno
LLVM-libc should do what the standards say. For example, if the standards say that errno has to be set to a certain value to indicate error, then LLVM-libc should do that. Likewise, if standards say that the function in question should return an error value, then LLVM libc should do that.
I can probably give more specific answers if you can point out the particular function you are asking about.
Thanks,
Siva Chandra
Hey guys, I'm rebasing and starting work on this again, sorry for the wait I moved 2000 miles from Michigan to Oregon.
I have a question tho.
Uchar and wchar both rely on mbstate_t which is a global variable for their conversions.
I'm currently using an enum to hold the valid states different conversions can be in, and I'm wondering how it should all be layered together?
Should uchar depend on wchar, should wchar depend on uchar, should there be a private header that both of them use?
and also what about namespaces, should the enum be in a private namespace and the mbstate_t global be in a public namespace?
Any clarification would be great, thanks.
The standard says mbstate_t should be of a struct type: https://en.cppreference.com/w/c/string/multibyte/mbstate_t
But, I didn't find anything which says the state is a libc maintained global state. So, other than the fact that we need to define that struct in a common place like this, I do not see anything affecting the layering. Am I missing something?
Should uchar depend on wchar, should wchar depend on uchar, should there be a private header that both of them use?
and also what about namespaces, should the enum be in a private namespace and the mbstate_t global be in a public namespace?
Yes, you can choose to keep the "internals" of mbstate_t in an internal namespace.
Any clarification would be great, thanks.