This is an archive of the discontinued LLVM Phabricator instance.

ConvertUTF: Created wrapper convertUTF32ToUTF8String
AbandonedPublic

Authored by MarcusJohnson91 on Jul 24 2021, 11:48 AM.

Details

Summary

Split from D103426

The other patches rely on this, so the faster this is accepted the better

Diff Detail

Event Timeline

MarcusJohnson91 requested review of this revision.Jul 24 2021, 11:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 24 2021, 11:48 AM
MarcusJohnson91 updated this revision to Diff 361487.EditedJul 24 2021, 8:39 PM
MarcusJohnson91 edited the summary of this revision. (Show Details)

Added more tests

I don't understand why the build failed?

I've compiled it and ran all the tests with time ninja check

Please fix clang-format issues. (You can use the clang-format-diff.py script.)

llvm/lib/Support/ConvertUTFWrapper.cpp
155

Wrong constant.

169

Wrong constant.

Is this really the function you want to be using from clang? I don't really understand why you'd want to handle byte order marks.

204

Wrong function?

MarcusJohnson91 marked 3 inline comments as done.Jul 25 2021, 1:31 PM
MarcusJohnson91 added inline comments.
llvm/lib/Support/ConvertUTFWrapper.cpp
169

I don't really care about the BOM tbh, I just figured if I was in here, I should flesh out the UTF-32 interface.

MarcusJohnson91 marked an inline comment as done.

Implemented the fixes mentioned and reformatted the patch

Anyone got any ideas what happened this time?

The buildbot seems to think your new unittests are broken. Not sure why. (You can run just the unittests with "ninja check-llvm-unit".)

llvm/include/llvm/Support/ConvertUTF.h
127

UNI_UTF32_BYTE_ORDER_MARK_SWAPPED doesn't have the correct bit pattern.

llvm/lib/Support/ConvertUTFWrapper.cpp
169

The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap L"\uFFFE%s" or something like that.

Dropped the UTF32 BOM stuff

efriedma added inline comments.Jul 27 2021, 4:12 PM
llvm/lib/Support/ConvertUTFWrapper.cpp
183

SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 seems like way too much memory.

llvm/lib/Support/ConvertUTFWrapper.cpp
183

I copied that from the UTF16 code

Updated the tests

The problem seems to be in the conversion function expecting strings to be a multiple of 4 bytes, which doesn't hold up with the way ArrayRef stores things as char that is casted to char32_t, when using ASCII values like in the look of disapproval emoji, having an underscore in the middle.

But removing the assert and early return result in even more errors.

changing the input string to remove the underscore also fails, i'm out of ideas.

The tests work on my machine now, turns out the Big endian one needs a BOM, pretty obvious in hindsight.

Formatted the diff

efriedma added inline comments.Jul 30 2021, 2:20 PM
llvm/lib/Support/ConvertUTFWrapper.cpp
169

Any thoughts on this?

183

I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16. A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8. That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8. A codepoint in UTF-8 is at most 4 bytes.

MarcusJohnson91 updated this revision to Diff 363216.EditedJul 30 2021, 3:06 PM

It seems like this diff keeps getting reverted?

I've fixed all the issues mentioned, and the tests work now, everything is formatted correctly too.

I've set git up to do full context diffs, but it's not working?

@efriedma it seems like you are commenting on old revisions?

the first comment about the UTF16 BOM in the UTF32 converter was fixed a long time ago and the second comment, line 168 I don't even see what you're talking about there

As far as I can tell, the lastest version of the diff you uploaded still has the following issues that haven't been addressed:

  1. The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap L"\uFFFE%s" or something like that.
  2. SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 seems like way too much memory.

There is only one function in ConvertUTFWrapper.cpp: convertUTF32ToUTF8String

idk wtf is going on, maybe the ammending the commit is breaking something?

the diff I see here is correct...

Maybe I should just make a new diff here entirely?

MarcusJohnson91 abandoned this revision.Jul 30 2021, 5:56 PM

If you're having trouble making Arcanist work correctly, you can always just upload "git diff" or "git show" output at https://reviews.llvm.org/differential/diff/create/ .