Page MenuHomePhabricator

[Clang] Add a warning on invalid UTF-8 in comments.
ClosedPublic

Authored by cor3ntin on Jun 17 2022, 7:34 AM.

Details

Summary

Introduce an off-by default -Winvalid-utf8 warning
that detects invalid UTF-8 code units sequences in comments.

Invalid UTF-8 in other places is already diagnosed,
as that cannot appear in identifiers and other grammar constructs.

The warning is off by default as its likely to be somewhat disruptive
otherwise.

This warning allows clang to conform to the yet-to be approved WG21
"P2295R5 Support for UTF-8 as a portable source file encoding"
paper.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
cor3ntin added inline comments.Jun 22 2022, 11:04 AM
clang/lib/Lex/Lexer.cpp
2406–2425

I don't think this applies as we are not inserting replacement characters, but I'm all for adding a comment.
I don't know if we should reference the c++ wording though, as I don;t think there is value in not applying that behavior in all language modes.

2719

This crashes when using _mm_load_si128 which suprises me because CurPtr is supposedly aligned on a 16 bytes boundary here. Any idea?

tahonermann added inline comments.Jun 22 2022, 11:54 AM
clang/lib/Lex/Lexer.cpp
2406–2425

Perhaps phrase it something like "diagnostics are issued consistent with the replacement character insertion strategy of Unicode PR-121 policy option 1". The goal is to make it clear that the diagnostic strategy is not out of thin air; that it follows Unicode endorsed behavior.

aaron.ballman added inline comments.Jun 22 2022, 12:06 PM
clang/lib/Lex/Lexer.cpp
2406–2425

Tom covered why I would prefer to see a comment, but as for the standards reference -- we already put in references to only one language for behavior we treat as an extension, so this would be more of the same. That ends up being helpful (especially if there's an explicit comment about it being an extension in other languages) because it tells the reader how the extension is roughly expected to behave as well.

2719

Wait, did you verify that CurPtr really is on a 16-byte boundary, or are you thinking it should be on such a boundary? (I don't see any alignment markings on the parameter, so I'd assume it's aligned as any other pointer.)

aaron.ballman added inline comments.Jun 22 2022, 12:31 PM
clang/lib/Lex/Lexer.cpp
2719

Derp, I missed that the loop above is manually aligning the pointer.

I'm not certain what's going on here with your crash...

cor3ntin added inline comments.Jun 22 2022, 1:01 PM
clang/lib/Lex/Lexer.cpp
2719

Just above, i do

while (C != '/' && ((intptr_t)CurPtr & 0x0F) != 0) {
        if (!isASCII(C)) {
          CurPtr--;
          break;
        }

Derp indeed (the CurPtr--; break is the culprit, fix shortly)

cor3ntin updated this revision to Diff 439177.Jun 22 2022, 3:17 PM
  • Add a comment referencing the (to be) C++ wording and unicode discussion on replacemet characters
  • Do not try to skip utf8 checks when the warning is not enabled as checking whether the warning is enabled is more expensive
  • Fix a bug causing vectorization to not be aligned
cor3ntin updated this revision to Diff 439276.Jun 23 2022, 1:46 AM
  • Remove explicit load instruction
  • cleanup extra braces

@aaron.ballman FYI this is now core-approved

erichkeane added inline comments.
clang/docs/ReleaseNotes.rst
276
cor3ntin updated this revision to Diff 441748.Jul 1 2022, 11:11 AM

Fix typo.

aaron.ballman added reviewers: tahonermann, Restricted Project.Jul 5 2022, 9:13 AM
aaron.ballman added inline comments.
clang/docs/ReleaseNotes.rst
277

Should we mention P2295R5 now that it's at least core approved? Something like:


...unit sequences in comments; in support of `P2295R5 <http://wg21.link/P2295R5>`_.

My plan is to do that for all papers once past plenary to
keep consistency with cxx_status.

aaron.ballman accepted this revision.Jul 6 2022, 8:25 AM

The changes here LGTM, thank you for this!

This revision is now accepted and ready to land.Jul 6 2022, 8:25 AM
This revision was landed with ongoing or failed builds.Jul 6 2022, 8:59 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Jul 6 2022, 10:18 AM

As far as I can tell, this breaks check-clang everywhere: http://45.33.8.238/

Please take a look and revert for now if it takes a while to fix.

As far as I can tell, this breaks check-clang everywhere: http://45.33.8.238/

Please take a look and revert for now if it takes a while to fix.

Thanks for letting me know, i hadn't noticed the bots screaming at me. I did revert until i can investigate and fix.

cor3ntin reopened this revision.Jul 6 2022, 11:09 AM
This revision is now accepted and ready to land.Jul 6 2022, 11:09 AM
cor3ntin updated this revision to Diff 442642.Jul 6 2022, 11:10 AM

Fix the test of a newly landed commit

This revision was landed with ongoing or failed builds.Jul 6 2022, 12:18 PM
This revision was automatically updated to reflect the committed changes.

The bots found a further issue which i committed a fix for directly https://reviews.llvm.org/D129223 - I should have caught that sooner

cor3ntin reopened this revision.Jul 6 2022, 3:17 PM
This revision is now accepted and ready to land.Jul 6 2022, 3:17 PM
cor3ntin updated this revision to Diff 442702.Jul 6 2022, 3:26 PM

Deploying that turned out to reveal a few critical issues

  • getUTF8SequenceSize never reported a non-zero length for valid

UTF-8 sequences.

  • In *some* circumstances (depending on the size of comment),

Unicode codepoints were parsed from one past their start,
because the CurPtr was sometimes, but not always, moved back.

I also added a test file with *valid* utf-8 in comments
(which would have caught these issues).

@aaron.ballman Ok, this didn't turn out great, and given the changes + the fact it was completely broken (twice!), I'd love if you could take a look at it again. Thanks :)

cor3ntin updated this revision to Diff 443222.Jul 8 2022, 5:40 AM

Fix bound check in the non vectorized case

aaron.ballman accepted this revision.Jul 8 2022, 5:42 AM

I only spotted one thing I think is actually an issue, the rest is style related. LGTM with the one issue fixed.

clang/lib/Lex/Lexer.cpp
2707–2709
2754–2769

< instead of <=?

2756–2758
2764–2766
cor3ntin updated this revision to Diff 443225.Jul 8 2022, 5:47 AM
cor3ntin marked 19 inline comments as done.

Remove superfuous braces.

This revision was landed with ongoing or failed builds.Jul 9 2022, 2:26 AM
This revision was automatically updated to reflect the committed changes.

@aaron.ballman Thanks for the review. I landed the changes and got a bunch of bots screaming at me for changes that are completely unrelated

https://lab.llvm.org/buildbot/#/builders/21/builds/45146
https://lab.llvm.org/buildbot/#/builders/36/builds/22838
https://lab.llvm.org/buildbot/#/builders/121/builds/21183

What do you mean it is completely unrelated? You have __ALTIVEC__-specific changes and it knocks out the PPC bots.

@aaron.ballman Thanks for the review. I landed the changes and got a bunch of bots screaming at me for changes that are completely unrelated

https://lab.llvm.org/buildbot/#/builders/21/builds/45146
https://lab.llvm.org/buildbot/#/builders/36/builds/22838
https://lab.llvm.org/buildbot/#/builders/121/builds/21183

What do you mean it is completely unrelated? You have __ALTIVEC__-specific changes and it knocks out the PPC bots.

Good point. The error was a bit misleading but i guess what's happening is a segfault when running clang-ast-dump.
I'm reverting for now and I don't really know how to go from there as I don't have the hardware to test that code path.

Good point. The error was a bit misleading but i guess what's happening is a segfault when running clang-ast-dump.
I'm reverting for now and I don't really know how to go from there as I don't have the hardware to test that code path.

Sending you info via e-mail on how to request access to systems.

cor3ntin reopened this revision.Jul 9 2022, 11:41 AM
This revision is now accepted and ready to land.Jul 9 2022, 11:41 AM
cor3ntin updated this revision to Diff 443699.Jul 11 2022, 11:32 AM

Fix crash on PowerPC

(I forgot to removed a non-sense line that
could cause the CurPtr to move incorrectly
past a / in the ALTIVEC code path).

aaron.ballman accepted this revision.Jul 12 2022, 5:16 AM

Fix crash on PowerPC

(I forgot to removed a non-sense line that
could cause the CurPtr to move incorrectly
past a / in the ALTIVEC code path).

Ah, nice catch -- LGTM!

This revision was landed with ongoing or failed builds.Jul 12 2022, 5:34 AM
This revision was automatically updated to reflect the committed changes.

I had to revert this because it breaks a bunch of LLDB tests: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/45288/#showFailuresLink. It looks like this causes an error when building some SDK modules.

cor3ntin reopened this revision.Jul 12 2022, 5:14 PM
This revision is now accepted and ready to land.Jul 12 2022, 5:14 PM

I had to revert this because it breaks a bunch of LLDB tests: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/45288/#showFailuresLink. It looks like this causes an error when building some SDK modules.

Thanks (and sorry for the annoyance)

cor3ntin updated this revision to Diff 444118.Jul 12 2022, 5:29 PM

This turned out to be an interesting bug.
The SSE code tried to be clever and skip over valid ascii code units when finding invalid UTF-8.
In doing so, it could run over the end of a comment entirely if

  • there was a short ascii comment
  • followed by a tiny amount of C++
  • followed by another comment containing non-ascii data.

It does not matter whether it was valid or not (which was misleading
as the file that tripped the bot is full of invalid code units.

The problematic test boils down to

enum a {
    x  /* 01234567890ABCDEF*/
};
/*ααααααααα*/

The fix is to do in the SSE codepath what we do in the altivec
and default paths: if we find an invalid code unit,
we rescan that bit of the comment on the slow path
without trying to update CurPtr (and for each code unit,
checking both for isASCIII an != '/' at the same time.

I had to revert this because it breaks a bunch of LLDB tests: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/45288/#showFailuresLink. It looks like this causes an error when building some SDK modules.

Thanks (and sorry for the annoyance)

No worries at all. Thanks for looking into the issue!

This revision was landed with ongoing or failed builds.Jul 13 2022, 1:19 AM
This revision was automatically updated to reflect the committed changes.