This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
include/
-
uchar.h.def
-
wchar.h.def
-
src/
-
CMakeLists.txt
-
uchar/
2
CMakeLists.txt
2
c16rtomb.h
2/9
c16rtomb.cpp
-
c32rtomb.h
1
c32rtomb.cpp
-
mbrtoc16.h
1
mbrtoc16.cpp
-
mbrtoc32.h
1/2
mbrtoc32.cpp
-
wchar/
-
mbsinit.cpp
-
test/src/
-
src/
-
CMakeLists.txt

Differential D74021

Created uChar implementation for libc
Needs ReviewPublic

Authored by MarcusJohnson91 on Feb 4 2020, 7:11 PM.

Download Raw Diff

Details

Reviewers

PaulkaToast
sivachandra

Diff Detail

Event Timeline

MarcusJohnson91 created this revision.Feb 4 2020, 7:11 PM

MarcusJohnson91 created this object with visibility "All Users".

Herald added subscribers: MaskRay, mgorny. · View Herald TranscriptFeb 4 2020, 7:11 PM

abrachet edited reviewers, added: sivachandra; removed: libc-commits.Feb 4 2020, 7:26 PM

abrachet changed the visibility from "All Users" to "Public (No Login Required)".Feb 4 2020, 7:44 PM

These need tests.

libc/src/uchar/CMakeLists.txt
8	remove comments
36	newline
libc/src/uchar/c16rtomb.cpp
16	Too long I guess. clang-format
17	Remove ULL.
19–22	We don't do block comments. Use //. Anyway I don't think you need to document library functions like this.
30	This is not how pointers work. I'm not sure what the point of this is.
libc/src/uchar/c16rtomb.h
17	No indentation. These all need clang-format run on them.
libc/src/uchar/c32rtomb.cpp
20–21	We don't align like this. clang-format will fix these.
libc/src/uchar/mbrtoc16.cpp
20–21	I think case statements should be aligned with the switch. clang-format will tell us either way :)
libc/src/uchar/mbrtoc32.cpp
18	What's the point of this?

Thanks a lot for the patch. I think @abrachet has already pointed out few problems. From my side, I am afraid I do not have time until the Friday to take a good look at it.

Clang-formatted, and replaced calloc with calls to new, as well as removed ULL from size_t variable initializations

also removed the block comment that accidentally slipped through.

As for tests, I'm not sure where to begin with that, can anyone point me in the right direction?

Fixed typo's.

Added trailing newline to CMakeLists

Clang-formatted, and replaced calloc with calls to new, as well as removed ULL from size_t variable initializations

also removed the block comment that accidentally slipped through

and fixed type with variable c16 being written as C16

For reference this is musl's implementation of c16rtomb http://git.musl-libc.org/cgit/musl/tree/src/multibyte/c16rtomb.c.

As for tests, I'm not sure where to begin with that, can anyone point me in the right direction?

You could replicate the example that cppreference has. https://en.cppreference.com/w/c/string/multibyte/c16rtomb. Otherwise where have you seen these functions use in the wild? I have never seen it so I can't give any real world examples unfortunately.

libcxx has pretty sparse tests https://github.com/llvm/llvm-project/tree/master/libcxx/test/std/strings https://github.com/llvm/llvm-project/tree/master/libcxx/test/libcxx/strings I'm afraid.

You could look at the standard library of more modern languages. For example Rust made a big deal about Utf16 strings, you could hunt down those tests https://doc.rust-lang.org/std/primitive.str.html.

And in musl's tests (cloned from git://repo.or.cz/libc-test) they have libc-test/src/functional/clocale_mbfuncs.c, which is the closest I could find.

libc/src/uchar/c16rtomb.cpp
25	For what its worth, this allocates one char and initializes it to StringSize. If you wanted to allocate StringSize chars you need `new char[StringSize]`. But `new` isn't safe to use for us, and I think it wouldn't even link because we have `-nostdlib` specified. void changeParam(int a) { a = 5; } int main() { int a = 0; changeParam(a); assert(!a); // passes, a not changed. } The same is true with pointers if I pass my `char *` to c16rtomb, if you reassign to it by malloc or new, it changes the local variable but not the callers variable. These functions (unless passed a nullptr) assume that `s` is already pointing to enough valid memory.

abrachet added inline comments.Feb 6 2020, 11:07 AM

libc/src/uchar/c16rtomb.h
1–2	clang-format did this because the line was too long, you should remove a few `-` to shorten it to I think 81 characters it might be 80.

Couple of high level comments:

Looks like the public uchar.h header or a rule to generate it are missing. Prefer generation. See string.h for example.
To define char16_t and char32_t types, use the predefined macro __UINT_LEAST16__TYPE__ and __UINT_LEAST32_TYPE__. It would have been ideal if the free standing stdint.h supports __need_* macros, but it does not.

In D74021#1861627, @MarcusJohnson91 wrote:

As for tests, I'm not sure where to begin with that, can anyone point me in the right direction?

Can you write unit tests using manually sythensized byte sequences.

libc/src/uchar/c16rtomb.cpp
16	It would be good if you can add comments explaining the different cases that being handled here.
25	If I am reading the standard correct, one should not need to allocate memory at all. It should be assumed that `s` is pointing to an appropriately sized byte array. So, one should not need `new`/`malloc` and any other allocator. I also think the same holds for the other functions as well.

Fixed the header blocks and reformatted

Removed all of the new calls, starting work on the latest comments

uint_leastX_t -> UINT_LEASTXTYPE__ in char16_t and char32_t typedefs

Added uchar.h.def based on string.h.def, but I don't understand what it's actually doing.

In D74021#1865554, @MarcusJohnson91 wrote:

Added uchar.h.def based on string.h.def, but I don't understand what it's actually doing.

Sorry, the documentation is not very clear at this moment. I will prepare a patch as an example for you.

Bump

libc/src/uchar/c16rtomb.cpp
16	0xD800 - 0xDFFF is a surrogate pair, the c16 parameter can not contain the 2 16 bit code units required to actually decode it to the proper UTF-32 codepoint. so should I try to make the decoder stateful in order to make that happen (not sure if that's even possible), or should I just replace it with the invalid replacement character?
25	Ahh, I appreciate the tip on new, I'm new to C++. As for if the string is allocated I'm trying to look up some examples of how these APIs are called on github, it'll be a lot harder to verify if a string can hold a codepoint/codeunit if it's pre-allocated, because each index would look like a null terminater to begin with, right?
libc/src/uchar/mbrtoc32.cpp
18	What's the point of what? the char32_t variable being initialized as an array instead of a plain value? The standard requires it, I agree it's dumb, but that's the API callers expect.

Ah sorry, this completely slipped from my radar. I am currently sick, but will try get to this before the end of this week.

Rebased and Squashed Uchar patch

Fixed a minor clang-format line length issue in the source files, and removed a duplicated cmake entry

Dushistov added a subscriber: Dushistov.Mar 12 2020, 4:35 PM

I'm refactoring this around wchar (and adding basic wchar support as well)

My plan is to wrap the wchar implementation around the uchar one, so this patch isn't ready yet.

Rebased on master

Sorry it took me this long to comment here. Part of it was because I had to educate myself about mutli-byte characters and wide characters. Few high level questions:

Are the functions char[16|32]rtomb doing a UTF-16|32 to UTF-8 conversion? Per the standard, they should convert to the current locale? May be UTF-8 is an acceptable target encoding. In which case, should we have an error reporting scheme when the locale is not set to UTF-8?
You mention building multi-byte support over wide char support. However, if I am reading it right, it doesn't seem like it?

A generic comment: I think you are not using pointers correctly. For example, in the mbrtoc16 function, you have this:

if ((s & 0x80) == 0) {
    // ASCII
} else if ((s & 0x80) == ) 
...

s is of type const char *restrict. Seems to me like that the intention here is to compare first the character *s?

In few other places, I see incomplete code. Is the patch ready for review?

In D74021#1936199, @sivachandra wrote:

Are the functions char[16|32]rtomb doing a UTF-16|32 to UTF-8 conversion? Per the standard, they should convert to the current locale?

Which standard are you referring to? I'm reading the C18 (N2176) standard just to be sure we're on the same page.

As for the c16rtomb and c32rtomb functions, they're defined in the uchar.h header, and the C18 standard says "7.28 Unicode utilities <uchar.h>", I take that to mean that these functions should convert between different representations of Unicode.

You mention building multi-byte support over wide char support. However, if I am reading it right, it doesn't seem like it?

I not sure what you're referring to here? Do you mean where I said that I only intend to implement enough of the wchar header to support uchar, aka declare the mbstate_t type, and the mbstate_init function; and only becaose the standard declares this type and function in wchar, maybe it'd be better to put these parts in an internal uchar implementation file?

A generic comment: I think you are not using pointers correctly. For example, in the mbrtoc16 function, you have this:
if ((s & 0x80) == 0) {
    // ASCII
} else if ((s & 0x80) == ) 
...
s is of type const char *restrict. Seems to me like that the intention here is to compare first the character *s?

Yes, you're right I originally wrote the code using different variable names using a very different implementation, and I messed up the syntax when copypasting the correct variable names and overlooked them when changing them.

In few other places, I see incomplete code. Is the patch ready for review?

Not quite, there's still a few things I'm working on, and I'm working on a few different patches at the same time as well.

Hey guys: @sivachandra.

My clang-format patch landed yesterday so I'm working again on my libc patch to add uchar and maybe wchar stuff.

I've been rewriting c16rtomb and it works a lot better now but I was wondering how we're supposed to handle errors in llvm-libc?

like, if there's a low surrogate without a high surrogate preceding it, that's an error.

a lone surrogate is an error as well,

etc.

please tell me we don't have to use errno

In D74021#2051565, @MarcusJohnson91 wrote:

Hey guys: @sivachandra.

My clang-format patch landed yesterday so I'm working again on my libc patch to add uchar and maybe wchar stuff.

I've been rewriting c16rtomb and it works a lot better now but I was wondering how we're supposed to handle errors in llvm-libc?

LLVM-libc should do what the standards say. For example, if the standards say that errno has to be set to a certain value to indicate error, then LLVM-libc should do that. Likewise, if standards say that the function in question should return an error value, then LLVM libc should do that.

I can probably give more specific answers if you can point out the particular function you are asking about.

Thanks,
Siva Chandra

Hey guys, I'm rebasing and starting work on this again, sorry for the wait I moved 2000 miles from Michigan to Oregon.

I have a question tho.

Uchar and wchar both rely on mbstate_t which is a global variable for their conversions.

I'm currently using an enum to hold the valid states different conversions can be in, and I'm wondering how it should all be layered together?

Should uchar depend on wchar, should wchar depend on uchar, should there be a private header that both of them use?

and also what about namespaces, should the enum be in a private namespace and the mbstate_t global be in a public namespace?

Any clarification would be great, thanks.

In D74021#2204954, @MarcusJohnson91 wrote:

Hey guys, I'm rebasing and starting work on this again, sorry for the wait I moved 2000 miles from Michigan to Oregon.

I have a question tho.

Uchar and wchar both rely on mbstate_t which is a global variable for their conversions.

I'm currently using an enum to hold the valid states different conversions can be in, and I'm wondering how it should all be layered together?

The standard says mbstate_t should be of a struct type: https://en.cppreference.com/w/c/string/multibyte/mbstate_t

But, I didn't find anything which says the state is a libc maintained global state. So, other than the fact that we need to define that struct in a common place like this, I do not see anything affecting the layering. Am I missing something?

Should uchar depend on wchar, should wchar depend on uchar, should there be a private header that both of them use?

and also what about namespaces, should the enum be in a private namespace and the mbstate_t global be in a public namespace?

Yes, you can choose to keep the "internals" of mbstate_t in an internal namespace.

Any clarification would be great, thanks.

Revision Contents

Path

Size

libc/

include/

uchar.h.def

38 lines

wchar.h.def

38 lines

src/

CMakeLists.txt

3 lines

uchar/

35 lines

20 lines

72 lines

20 lines

40 lines

20 lines

70 lines

21 lines

43 lines

wchar/

mbsinit.cpp

19 lines

test/

src/

CMakeLists.txt

1 line

Diff 251845

libc/include/uchar.h.def

This file was added.

				//===---------------- C standard library header uchar.h ------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_UCHAR_H
				#define LLVM_LIBC_UCHAR_H

				#include <__llvm-libc-common.h>

				#if !defined(_SIZE_T)
				#define _SIZE_T
				typedef __SIZE_TYPE__ size_t;
				#endif

				#if !defined(_CHAR16_T)
				#define _CHAR16_T
				typedef __CHAR16_TYPE__ char16_t;
				#endif

				#if !defined(_CHAR32_T)
				#define _CHAR32_T
				typedef __CHAR32_TYPE__ char32_t;
				#endif

				#if !defined(_MBSTATE_T)
				#define _MBSTATE_T
				typedef __WCHAR_TYPE__ mbstate_t;
				#endif

				%%include_file(${platform_uchar})

				%%public_api()

				#endif // LLVM_LIBC_UCHAR_H

libc/include/wchar.h.def

This file was added.

				//===---------------- C standard library header signal.h ------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_WCHAR_H
				#define LLVM_LIBC_WCHAR_H

				#include <__llvm-libc-common.h>

				#if !defined(_SIZE_T)
				#define _SIZE_T
				typedef __SIZE_TYPE__ size_t;
				#endif

				#if !defined(_WCHAR_T)
				#define _WCHAR_T
				typedef __WCHAR_TYPE__ wchar_t;
				#endif

				#if !defined(_WINT_T)
				#define _WINT_T
				typedef __WINT_TYPE__ wint_t;
				#endif

				#if !defined(_MBSTATE_T)
				#define _MBSTATE_T
				typedef __WCHAR_TYPE__ mbstate_t;
				#endif

				%%include_file(${platform_wchar})

				%%public_api()

				#endif // LLVM_LIBC_WCHAR_H

libc/src/CMakeLists.txt

	add_subdirectory(assert)			add_subdirectory(assert)
	add_subdirectory(errno)			add_subdirectory(errno)
	add_subdirectory(math)			add_subdirectory(math)
	add_subdirectory(signal)			add_subdirectory(signal)
	add_subdirectory(stdlib)			add_subdirectory(stdlib)
	add_subdirectory(string)			add_subdirectory(string)
	# TODO: Add this target conditional to the target OS.
	add_subdirectory(sys)			add_subdirectory(sys)
	add_subdirectory(threads)			add_subdirectory(threads)
				add_subdirectory(uchar)
				# TODO: Add this target conditional to the target OS.

	add_subdirectory(__support)			add_subdirectory(__support)

libc/src/uchar/CMakeLists.txt

This file was added.

				add_entrypoint_object(
				mbrtoc16
				SRCS
				mbrtoc16.cpp
				HDRS
				mbrtoc16.h
				#DEPENDS
				)
				abrachetUnsubmitted Not Done Reply Inline Actions remove comments abrachet: remove comments

				add_entrypoint_object(
				c16rtomb
				SRCS
				c16rtomb.cpp
				HDRS
				c16rtomb.h
				#DEPENDS
				)

				add_entrypoint_object(
				mbrtoc32
				SRCS
				mbrtoc32.cpp
				HDRS
				mbrtoc32.h
				#DEPENDS
				)

				add_entrypoint_object(
				c32rtomb
				SRCS
				c32rtomb.cpp
				HDRS
				c32rtomb.h
				#DEPENDS
				)
				abrachetUnsubmitted Not Done Reply Inline Actions newline abrachet: newline

libc/src/uchar/c16rtomb.h

This file was added.

				//===--------------- Implementation header for c16rtomb -----------------===//
				//
				abrachetUnsubmitted Not Done Reply Inline Actions clang-format did this because the line was too long, you should remove a few `-` to shorten it to I think 81 characters it might be 80. abrachet: clang-format did this because the line was too long, you should remove a few `-` to shorten it…
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===--------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_UCHAR_C16RTOMB_H
				#define LLVM_LIBC_SRC_UCHAR_C16RTOMB_H

				#include "include/uchar.h"

				namespace __llvm_libc {

				size_t c16rtomb(char restrict s, char16_t c16, mbstate_t restrict ps);

				abrachetUnsubmitted Not Done Reply Inline Actions No indentation. These all need clang-format run on them. abrachet: No indentation. These all need clang-format run on them.
				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_UCHAR_C16RTOMB_H

libc/src/uchar/c16rtomb.cpp

This file was added.

				//===-------------------- Implementation of c16rtomb ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/uchar/c16rtomb.h"
				#include "../../include/wchar.h"

				#include "src/__support/common.h"

				namespace __llvm_libc {

				size_t LLVM_LIBC_ENTRYPOINT(c16rtomb)(char *restrict s, char16_t c16,
				abrachetUnsubmitted Not Done Reply Inline Actions Too long I guess. clang-format abrachet: Too long I guess. clang-format
				sivachandraUnsubmitted Not Done Reply Inline Actions It would be good if you can add comments explaining the different cases that being handled here. sivachandra: It would be good if you can add comments explaining the different cases that being handled here.
				MarcusJohnson91AuthorUnsubmitted Done Reply Inline Actions 0xD800 - 0xDFFF is a surrogate pair, the c16 parameter can not contain the 2 16 bit code units required to actually decode it to the proper UTF-32 codepoint. so should I try to make the decoder stateful in order to make that happen (not sure if that's even possible), or should I just replace it with the invalid replacement character? MarcusJohnson91: 0xD800 - 0xDFFF is a surrogate pair, the c16 parameter can not contain the 2 16 bit code units…
				mbstate_t *restrict ps) {
				abrachetUnsubmitted Not Done Reply Inline Actions Remove ULL. abrachet: Remove ULL.
				if (c16 >= 0xD800 && c16 <= 0xDBFF) {
				char16_t Value = c16 & 0x3FF;
				if (Value < 0x7F) {
				s = c16 & 0x7F;
				} else {
				abrachetUnsubmitted Not Done Reply Inline Actions We don't do block comments. Use //. Anyway I don't think you need to document library functions like this. abrachet: We don't do block comments. Use //. Anyway I don't think you need to document library functions…
				if (ps == 0) {
				s = 0xC0 \| ((Value & 0x3F0) >> 4);
				ps += 1;
				abrachetUnsubmitted Not Done Reply Inline Actions For what its worth, this allocates one char and initializes it to StringSize. If you wanted to allocate StringSize chars you need `new char[StringSize]`. But `new` isn't safe to use for us, and I think it wouldn't even link because we have `-nostdlib` specified. void changeParam(int a) { a = 5; } int main() { int a = 0; changeParam(a); assert(!a); // passes, a not changed. } The same is true with pointers if I pass my `char ` to c16rtomb, if you reassign to it by malloc or new, it changes the local variable but not the callers variable. These functions (unless passed a nullptr) assume that `s` is already pointing to enough valid memory. abrachet:* For what its worth, this allocates one char and initializes it to StringSize. If you wanted to…
				MarcusJohnson91AuthorUnsubmitted Done Reply Inline Actions Ahh, I appreciate the tip on new, I'm new to C++. As for if the string is allocated I'm trying to look up some examples of how these APIs are called on github, it'll be a lot harder to verify if a string can hold a codepoint/codeunit if it's pre-allocated, because each index would look like a null terminater to begin with, right? MarcusJohnson91: Ahh, I appreciate the tip on new, I'm new to C++. As for if the string is allocated I'm trying…
				sivachandraUnsubmitted Not Done Reply Inline Actions If I am reading the standard correct, one should not need to allocate memory at all. It should be assumed that `s` is pointing to an appropriately sized byte array. So, one should not need `new`/`malloc` and any other allocator. I also think the same holds for the other functions as well. sivachandra: If I am reading the standard correct, one should not need to allocate memory at all. It should…
				} else if (ps == 1) {
				s = 0x80 \| (Value & 0xF);
				ps = 0;
				}
				}
				abrachetUnsubmitted Not Done Reply Inline Actions This is not how pointers work. I'm not sure what the point of this is. abrachet: This is not how pointers work. I'm not sure what the point of this is.
				} else if (c16 >= 0xDC00 && c16 <= 0xDFFF) {
				char16_t Value = c16 & 0x3FF;
				if (Value < 0x7F) {
				s = Value & 0x7F;
				} else {
				if (ps == 2) {
				s = 0x80 \| ((Value & 0x3F0) >> 4);
				ps += 1;
				} else if (ps == 3) {
				s = 0x80 \| (Value & 0xF);
				ps = 0;
				}
				}
				} else {
				if (c16 < 0x7F) {
				s = c16 & 0x7F;
				ps += 1;
				} else if (c16 < 0x7FF) {
				if (ps == 0) {
				s = 0xC0 \| ((Value & 0x7C0) >> 6);
				ps += 1;
				} else if (ps == 1) {
				s = 0x80 \| (Value & 0x3F);
				ps += 1;
				}
				} else if (c16 < 0xFFFF) {
				if (ps == 0) {
				s = 0xE0 \| ((Value & 0xF000) >> 12);
				ps += 1;
				} else if (ps == 1) {
				s = 0x80 \| ((Value & 0xFC0) >> 6);
				ps += 1;
				} else if (ps == 2) {
				s = 0x80 \| (Value & 0x3F);
				ps = 0;
				}
				}
				}
				return CodePointSizeInUTF8CodeUnits;
				}

				} // namespace __llvm_libc

libc/src/uchar/c32rtomb.h

This file was added.

				//===--------------- Implementation header for c32rtomb -----------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===--------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_UCHAR_C32RTOMB_H
				#define LLVM_LIBC_SRC_UCHAR_C32RTOMB_H

				#include "include/uchar.h"

				namespace __llvm_libc {

				size_t c16rtomb(char restrict s, char16_t c16, mbstate_t restrict ps);

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_UCHAR_C16RTOMB_H

libc/src/uchar/c32rtomb.cpp

This file was added.

				//===-------------------- Implementation of c32rtomb ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/uchar/c32rtomb.h"

				#include "src/__support/common.h"

				namespace __llvm_libc {

				size_t LLVM_LIBC_ENTRYPOINT(c32rtomb)(char *restrict s, char32_t c32,
				mbstate_t *restrict ps) {
				size_t StringSize = 0;
				if (c32 <= 0x7F) {
				StringSize = 1;
				s[0] = c32 & 0x7F;
				} else if (c32 <= 0x7FF) {
				abrachetUnsubmitted Not Done Reply Inline Actions We don't align like this. clang-format will fix these. abrachet: We don't align like this. clang-format will fix these.
				StringSize = 2;
				s[0] = 0xC0 \| (c32 & ((0x1F << 6) >> 6));
				s[1] = 0x80 \| (c32 & 0x3F);
				} else if (c32 <= 0xFFFF) {
				StringSize = 3;
				s[0] = 0xE0 \| (c32 & ((0x0F << 12) >> 12));
				s[1] = 0x80 \| (c32 & ((0x3F << 6) >> 6));
				s[2] = 0x80 \| (c32 & 0x3F);
				} else if (c32 <= 0x10FFFF) {
				StringSize = 4;
				s[0] = 0xF0 \| (c32 & 0x1C0000) >> 18;
				s[1] = 0x80 \| (c32 & 0x3F000) >> 12;
				s[2] = 0x80 \| (c32 & 0xFC0) >> 6;
				s[3] = 0x80 \| (c32 & 0x3F);
				}
				return StringSize;
				}

				} // namespace __llvm_libc

libc/src/uchar/mbrtoc16.h

This file was added.

				//===---------------- Implementation header for mbrtoc16 -----------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===---------------------------------------------------------------------===//
				#ifndef LLVM_LIBC_SRC_UCHAR_MBRTOC16_H
				#define LLVM_LIBC_SRC_UCHAR_MBRTOC16_H

				#include "include/uchar.h"

				namespace __llvm_libc {

				size_t mbrtoc16(char16_t restrict pc16, const char restrict s, size_t n,
				mbstate_t *restrict ps);

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_UCHAR_MBRTOC16_H

libc/src/uchar/mbrtoc16.cpp

This file was added.

				//===-------------------- Implementation of mbrtoc16 ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/uchar/mbrtoc16.h"

				#include "src/__support/common.h"

				namespace __llvm_libc {

				size_t LLVM_LIBC_ENTRYPOINT(mbrtoc16)(char16_t *restrict pc16,
				const char *restrict s, size_t n,
				mbstate_t *restrict ps) {

				// utf-8 to utf-16

				// first check bit 1 and 1 and 2 to see if theyre leading or trailing bytes
				abrachetUnsubmitted Not Done Reply Inline Actions I think case statements should be aligned with the switch. clang-format will tell us either way :) abrachet: I think case statements should be aligned with the switch. clang-format will tell us either way…

				if (ps == 0) {
				if ((s & 0x80) == 0) {
				// ASCII
				} else if ((s & 0x80) == ) {

				}
				} else {

				}



				size_t StringSize = 0;
				char32_t Decoded = 0;

				switch (n) {
				case 1:
				Decoded = s[0] & 0x7F;
				break;
				case 2:
				Decoded \|= (s[0] & 0x1F) << 6;
				Decoded \|= (s[1] & 0x3F) << 0;
				break;
				case 3:
				Decoded \|= (s[0] & 0x0F) << 12;
				Decoded \|= (s[1] & 0x1F) << 6;
				Decoded \|= (s[2] & 0x1F) << 0;
				break;
				case 4:
				Decoded \|= (s[0] & 0x07) << 18;
				Decoded \|= (s[1] & 0x3F) << 12;
				Decoded \|= (s[2] & 0x3F) << 6;
				Decoded \|= (s[3] & 0x3F) << 0;
				break;
				}

				if (Decoded <= 0xFFFF) {
				StringSize = 1;
				pc16[0] = Decoded & 0xFFFF;
				} else if (Decoded <= 0x10FFFF) {
				StringSize = 2;
				pc16[0] = 0xD800 + ((Decoded & 0xFFC00) >> 10);
				pc16[1] = 0xDC00 + (Decoded & 0x3FF);
				}
				return StringSize;
				}

				} // namespace __llvm_libc

libc/src/uchar/mbrtoc32.h

This file was added.

				//===---------------- Implementation header for mbrtoc32 -----------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===---------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_UCHAR_MBRTOC32_H
				#define LLVM_LIBC_SRC_UCHAR_MBRTOC32_H

				#include "include/uchar.h"

				namespace __llvm_libc {

				size_t mbrtoc32(char32_t restrict pc32, const char restrict s, size_t n,
				mbstate_t *restrict ps);

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_UCHAR_MBRTOC32_H

libc/src/uchar/mbrtoc32.cpp

This file was added.

				//===-------------------- Implementation of mbrtoc32 ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/uchar/mbrtoc32.h"

				#include "src/__support/common.h"

				namespace __llvm_libc {

				size_t LLVM_LIBC_ENTRYPOINT(mbrtoc32)(char32_t *restrict pc32,
				const char *restrict s, size_t n,
				mbstate_t *restrict ps) {
				size_t StringSize = 1;
				abrachetUnsubmitted Not Done Reply Inline Actions What's the point of this? abrachet: What's the point of this?
				MarcusJohnson91AuthorUnsubmitted Done Reply Inline Actions What's the point of what? the char32_t variable being initialized as an array instead of a plain value? The standard requires it, I agree it's dumb, but that's the API callers expect. MarcusJohnson91: What's the point of what? the char32_t variable being initialized as an array instead of a…

				switch (n) {
				case 1:
				pc32[0] = s[0] & 0x7F;
				break;
				case 2:
				pc32[0] \|= (s[0] & 0x1F) << 6;
				pc32[0] \|= (s[1] & 0x3F) << 0;
				break;
				case 3:
				pc32[0] \|= (s[0] & 0x0F) << 12;
				pc32[0] \|= (s[1] & 0x1F) << 6;
				pc32[0] \|= (s[2] & 0x1F) << 0;
				break;
				case 4:
				pc32[0] \|= (s[0] & 0x07) << 18;
				pc32[0] \|= (s[1] & 0x3F) << 12;
				pc32[0] \|= (s[2] & 0x3F) << 6;
				pc32[0] \|= (s[3] & 0x3F) << 0;
				break;
				}
				return StringSize;
				}

				} // namespace __llvm_libc

libc/src/wchar/mbsinit.cpp

This file was added.

				//===-------------------- Implementation of mbrtoc16 ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "../../wchar.h"

				#include "src/__support/common.h"

				namespace __llvm_libc {

				int mbsinit(const mbstate_t *ps) {
				return ps == nullptr ? 1 : 0;
				}

				} // namespace __llvm_libc

libc/test/src/CMakeLists.txt

	add_subdirectory(assert)			add_subdirectory(assert)
	add_subdirectory(errno)			add_subdirectory(errno)
	add_subdirectory(signal)			add_subdirectory(signal)
	add_subdirectory(stdlib)			add_subdirectory(stdlib)
	add_subdirectory(string)			add_subdirectory(string)
	add_subdirectory(sys)			add_subdirectory(sys)
	add_subdirectory(threads)			add_subdirectory(threads)
				add_subdirectory(uchar)

This is an archive of the discontinued LLVM Phabricator instance.

Created uChar implementation for libcNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 251845

libc/include/uchar.h.def

libc/include/wchar.h.def

libc/src/CMakeLists.txt

libc/src/uchar/CMakeLists.txt

libc/src/uchar/c16rtomb.h

libc/src/uchar/c16rtomb.cpp

libc/src/uchar/c32rtomb.h

libc/src/uchar/c32rtomb.cpp

libc/src/uchar/mbrtoc16.h

libc/src/uchar/mbrtoc16.cpp

libc/src/uchar/mbrtoc32.h

libc/src/uchar/mbrtoc32.cpp

libc/src/wchar/mbsinit.cpp

libc/test/src/CMakeLists.txt

Created uChar implementation for libc
Needs ReviewPublic