Page MenuHomePhabricator

[Support] Fix formatted_raw_ostream for UTF-8

Authored by ostannard on Mar 17 2020, 8:44 AM.


  • The getLine and getColumn functions need to update the position, or they will return stale data for buffered streams. This fixes a bug in the clang -analyzer-checker-option-help option, which was not wrapping the help text correctly when stdout is not a TTY.
  • If the stream contains multi-byte UTF-8 sequences, then the whole sequence needs to be considered to be a single character. This has the edge case that the buffer might fill up and be flushed part way through a character.
  • If the stream contains East Asian wide characters, these will be rendered twice as wide as other characters, so we need to increase the column count to match.

This doesn't attempt to handle everything unicode can do (combining characters, right-to-left markers, ...), but hopefully covers most things likely to be common in messages and source code we might want to print.

Diff Detail

Event Timeline

ostannard created this revision.Mar 17 2020, 8:44 AM
benlangmuir added inline comments.Mar 17 2020, 11:30 AM

The changes related to PartialUTF8Char LGTM, thanks!

hubert.reinterpretcast added inline comments.

s/UTF-8 character/UTF-8 code unit sequence for a Unicode scalar value/;


s/a UTF-8 character/the UTF-8 encoding of a Unicode scalar value/;




Should there be a test for combining characters?


s/chinese/Chinese/; or CJK.


Same comment re: CJK.


Same comment re: "Unicode".


s/code-point/code point/;


Typo: s/ane/and/;


s/UTF-8 code point/UTF-8 code unit sequence for a Unicode scalar value/;


Comma after "happens".







hoyFB added inline comments.Mar 17 2020, 10:41 PM

LGTM, thanks for fixing this!

ostannard marked 16 inline comments as done.Mar 18 2020, 3:59 AM
ostannard added inline comments.

This doesn't support combining characters, I don't think there's much point in adding a test for something which doesn't work.

ostannard updated this revision to Diff 251030.Mar 18 2020, 4:18 AM
  • I managed to reproduce the lsl-zero.s test failure on windows, this turned out to be because the test merges stdout and stderr, which can result in them being interleaved in ways which breaks tests. The test doesn't check anything in stderr, so fixed by redirecting that to /dev/null.
  • The unit test was failing when built with MSVC, fixed by explicitly using UTF-8 string literals.

The overflow-free version should be preferred: Size < NumBytes - PartialUTF8Char.size()

Considering that NumBytes - PartialUTF8Char.size() is already used in the else, this might as well be named NumBytesNeededFromBuffer first (at which point it would be the only use of NumBytes, so we can get rid of NumBytes):

size_t NumBytesNeededFromBuffer =
    getNumBytesForUTF8(PartialUTF8Char[0]) - PartialUTF8Char.size();
if (Size < NumBytesNeededFromBuffer) {

Prefer the overflow-free version: End - Ptr < NumBytes. If inclined to name End - Ptr (it would occur twice), BytesAvailable makes sense.


Got it; thanks.

ostannard updated this revision to Diff 251105.Mar 18 2020, 8:57 AM
ostannard marked 2 inline comments as done.

I have no further concerns with this patch (although I would suggest editing the description to use "Unicode" capitalized). I will leave approval of the patch to one of the requested reviewers.

This looks fine to me; I just have a number of nit picks.
The only part where I don't understand the code logic is around the comment starting with "If this is the final byte of a multi-byte sequence".


I think it would be useful to add a description to this comment that:
(a) This class assumes that a UTF-8 encoding is used on the Stream; and
(b) "This doesn't attempt to handle everything unicode can do (combining characters, right-to-left markers, ...), but hopefully covers most things likely to be common in messages and source code we might want to print."


<bikeshedding mode>
Given that ProcessCodePoint assumes that the Unicode code point represented in the UTF-8 encoding, maybe it be slightly better to name the lambda as ProcessUTF8CodePoint rather than ProcessCodePoint?
</bikeshedding mode>


I'm wondering if using sys::unicode::columnWidthUTF8 instead of sys::locale::columnWidth would be more future-proof and more clearly describe the intent that this function only processes UTF-8 and not strings encoded in other encodings?


The documentation for sys::unicode::columnWidthUTF8 documents it returns ErrorNonPrintableCharacter (-1) if the text contains non-printable characters.
Maybe it's more self-documenting to compare against ErrorNonPrintableCharacter rather than -1 in the above if condition?


Reading through the code linearly from the top to the bottom, I'm a bit surprised by this comment.
I would expect CP to contain exactly the bytes that when interpreted as a UTF-8 encoded Unicode character, form exactly one Unicode character.
Therefore, I'm not sure how to interpret "If this is the final byte of a multi-byte sequence.".
I'm expecting "this" to refer to "CP" in this context. But that cannot be "just" the final byte of a multi-byte sequence, unless my assumption that CP contains exactly the bytes forming a single UTF-8 encoded Unicode character is wrong.
Could CP contain a partial UTF-8 encoded character? If so, maybe it'd be better to change the name ProcessCodePoint to something that suggests that could be possible?


I guess "This" refers to \u2468? If so, it'd be easier to read this comment if it was written like: "// \u2468 encodes as three bytes, ..."

Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2020, 4:29 AM
ostannard updated this revision to Diff 275029.Jul 2 2020, 2:47 AM
ostannard marked 5 inline comments as done.
kristof.beyls accepted this revision.Jul 6 2020, 2:21 AM

LGTM, thanks!

This revision is now accepted and ready to land.Jul 6 2020, 2:21 AM
This revision was automatically updated to reflect the committed changes.