This is an archive of the discontinued LLVM Phabricator instance.

ConvertUTF: convertUTF32ToUTF8String
AbandonedPublic

Authored by MarcusJohnson91 on Jul 30 2021, 4:19 PM.

Details

Reviewers
efriedma
Summary

New since git am was fucked somehow.

Maybe it's the pre-commit hook I created to run clang-format-diff.py that's fucking everything up?

Diff Detail

Event Timeline

MarcusJohnson91 requested review of this revision.Jul 30 2021, 4:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 30 2021, 4:19 PM

As far as I can tell, the lastest version of the diff you uploaded still has the following issues that haven't been addressed:

  1. The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap L"\uFFFE%s" or something like that.
  2. SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 seems like way too much memory.

As far as I can tell, the lastest version of the diff you uploaded still has the following issues that haven't been addressed:

  1. The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap L"\uFFFE%s" or something like that.
  2. SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 seems like way too much memory.

What BOM handling? there is no BOM function, bytes are swapped in the converter if the byte order isn't correct, is that what you mean?

I copied SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 from the UTF-16 version.

Are you asking me to change the UTF-16 version too?

What BOM handling? there is no BOM function, bytes are swapped in the converter if the byte order isn't correct, is that what you mean?

I mean the behavior handling strings that contain UNI_UTF32_BYTE_ORDER_MARK_SWAPPED.

I suspect a lot of places don't want the BOM handling to trigger. This includes trying to print diagnostics for wprintf, since the underlying function doesn't have any BOM handling. But I guess it's unlikely to matter in practice.

I copied SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 from the UTF-16 version.

Are you asking me to change the UTF-16 version too?

I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16. A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8. That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8. A codepoint in UTF-8 is at most 4 bytes.

I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16. A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8. That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8. A codepoint in UTF-8 is at most 4 bytes.

I've written my own Unicode encoder/decoder before, I'm familiar with how it works.

You can store regular ASCII in a UTF-32 string, like "Example" as UTF-32 would be 7 * 4 = 28 bytes (not counting the null terminator), where as it would just be 7 bytes in UTF-8.

and it looks like the std::string is being compacted afterwards with Out.resize(reinterpret_cast<char *>(Dst) - &Out[0]);

but maybe a call to Out.shrink_to_fit() at the end is warranted?

The the way the math is written now, for "Example", we allocate UNI_MAX_UTF8_BYTES_PER_CODE_POINT * sizeof(UTF32) * 7 = 112 bytes.

The the way the math is written now, for "Example", we allocate UNI_MAX_UTF8_BYTES_PER_CODE_POINT * sizeof(UTF32) * 7 = 112 bytes.

Alright, I'm gonna give it a try and re-run the tests

Seems like the tests still work

MarcusJohnson91 updated this revision to Diff 387786.EditedNov 16 2021, 4:35 PM

clang-format diff

Ok, the tests passed, can this be merged now?

MarcusJohnson91 abandoned this revision.Nov 21 2021, 6:15 PM

Reposted as @D114342