This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Log rather than assert on bad UTF-8.
ClosedPublic

Authored by sammccall on Jun 9 2020, 9:41 PM.

Details

Summary

I don't love this change, but it prevents crashing when indexing boost headers,
and I can't think of a better practical alternative.

Based on a patch by AnakinZheng!

Diff Detail

Event Timeline

sammccall created this revision.Jun 9 2020, 9:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 9 2020, 9:41 PM

@jaafar would you be able to verify this is equivalent to D74731 for your purposes?
If so I'd like to cherrypick this into tho 10.x branch after landing.

kadircet accepted this revision.Jun 10 2020, 1:51 AM

LGTM, thanks! (i bet you rushed it already for the cherry-pick, but just wanted to remind again that we should :D)

This revision is now accepted and ready to land.Jun 10 2020, 1:51 AM
This revision was automatically updated to reflect the committed changes.

I can confirm that this commit works equally well for the UTF-8 assertion failure. Thank you!

Is there some way to easily modify the source to produce a diagnostic pointing to the issue's source location? I would like to file an appropriate bug in Boost.

I can confirm that this commit works equally well for the UTF-8 assertion failure. Thank you!

Is there some way to easily modify the source to produce a diagnostic pointing to the issue's source location? I would like to file an appropriate bug in Boost.

In https://www.boost.org/doc/libs/1_73_0/boost/spirit/home/support/char_encoding/iso8859_1.hpp
The line labeled 161 a1, whose bytes are "/* \xa1 161 a1 */ BOOST_CC_PUNCT," triggers it because of the "\xa1".

It's not a cut and dried bug, but I do think removing the high bytes from these comments is a good idea, so it's probably worth filing the bug.

The C++ standard doesn't say anything about the encoding of characters on disk (the "input encoding" of bytes -> source character set) - it starts with the source character set, I think. So what do implementations do?

  • clang: from reading the code, clang only supports UTF-8. It supports gcc's -finput-charset flag, but setting it to anything other than UTF-8 is an error!
  • GCC: supports a variety of input encodings, configured with -finput-charset or locale. As such it's not really reasonable for a header designed to be included to require any particular value, as you can't pass a flag for just that header.

In practice this doesn't actually affect compilation because the bad UTF-8 sequence in the comment is never parsed: clang just skips over it looking for */. It probably mostly affects tools that use line/column coordinates (like clangd by virtue of LSP, and clang diagnostics), and tools that extract comment contents (doxygen et al).
But it still seems that it would be clearer to remove the literal characters from the source file, or write them in UTF-8.

Thanks for the detailed analysis! I have filed https://github.com/boostorg/spirit/issues/612