This is an archive of the discontinued LLVM Phabricator instance.

[Clang] Add a warning on invalid UTF-8 in comments.
ClosedPublic

Authored by cor3ntin on Jun 17 2022, 7:34 AM.

Details

Summary

Introduce an off-by default -Winvalid-utf8 warning
that detects invalid UTF-8 code units sequences in comments.

Invalid UTF-8 in other places is already diagnosed,
as that cannot appear in identifiers and other grammar constructs.

The warning is off by default as its likely to be somewhat disruptive
otherwise.

This warning allows clang to conform to the yet-to be approved WG21
"P2295R5 Support for UTF-8 as a portable source file encoding"
paper.

Diff Detail

Event Timeline

cor3ntin created this revision.Jun 17 2022, 7:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2022, 7:34 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
cor3ntin requested review of this revision.Jun 17 2022, 7:34 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 17 2022, 7:34 AM
aaron.ballman added inline comments.Jun 21 2022, 6:57 AM
clang/include/clang/Basic/DiagnosticLexKinds.td
116–117

What would it take to enable the diagnostic by default? We don't typically add off-by-default diagnostics because there's evidence that not many folks enable them.

Alternatively, would this make sense as a pedantic warning? It's not really an extension for us to accept this stuff, but it seems like we can still pedantically diagnose the code?

clang/lib/Lex/Lexer.cpp
2400

It looks like this can move into the while loop and we can remove some = false from within the loop?

2408

Do you have some idea of performance cost for enabling the warning with a large TU? Are we talking .5%? 1%? 5%? noise?

2416

So if unicode decoding didn't fail... diagnose? :-) The naming here seems unfortunate.

2416–2418
2696

I think this can move into the loop as well.

2806–2808

Same concerns about naming here as above.

llvm/lib/Support/ConvertUTF.cpp
426–429
cor3ntin updated this revision to Diff 438926.Jun 21 2022, 11:44 PM
cor3ntin marked 3 inline comments as done.
  • Address style comments
  • Improve commit message
  • Enable the warning in -pedantic
cor3ntin added inline comments.Jun 21 2022, 11:49 PM
clang/lib/Lex/Lexer.cpp
2400

The idea here is to not diagnose a contiguous sequence of invalid code units more than once
The sequence 0x80x0x80x80 for example will only warn one for example. The goal is to avoid noise when, for example there is a long comment in Windows-1251 encoded cyrillic. it would warn one per word (because the spaces would decode fine) rather than one per contiguous non-sense character.
It's also somewhat necessary because we can't know where the end of the invalid "character" is, better to be greedy.
(The test file check that behavior)

2408

This adds 2 comparisons when the warning is enbled (is ASCII(c) is simply C < 0x80) - so it should be noise, but I have not run benchmark.
It might be more noticeable within multi line comments as there the optimization that vectorizes the comment skipping on SSE2 platforms is simply skipped when this optimization is enabled.

2416

if(!UnicodeDecodeFailedInThePreviousLoopIteration) ? I'm open to suggestion here

aaron.ballman added inline comments.Jun 22 2022, 4:36 AM
clang/lib/Lex/Lexer.cpp
2400

Oh derp, you're right (of course), sorry for the noise!

2408

Ah, no need to run the benchmark then, I was thinking this would involve function call overhead and more complex checking for the isASCII call than exists in reality.

2416

UnicodeDecodingAlreadyDiagnosed

2701–2702

However, *this* seems like it could noticeably impact performance, especially for C code (including C header files in C++ mode), and may be worth getting measurements for.

cor3ntin updated this revision to Diff 438984.Jun 22 2022, 5:18 AM

s/UnicodeDecodeFailed/UnicodeDecodingAlreadyDiagnosed/

cor3ntin updated this revision to Diff 438990.Jun 22 2022, 5:37 AM

Make sure the warning is off by default.

tahonermann added inline comments.
clang/include/clang/Basic/DiagnosticLexKinds.td
116–117

I agree with enabling this by default in at least some warning mode. Aaron's suggestion of making it a pedantic warning seems reasonable to me.

clang/lib/Lex/Lexer.cpp
2406–2425

I think it would be helpful to include a link to Unicode PR issue 121 here and note that the diagnostic follows policy option 1. Likewise for handling of '//' comment styles below. Alternatively, provide the link and a note in the release notes.

cor3ntin updated this revision to Diff 439047.Jun 22 2022, 9:09 AM

Vectorizes comment skipping even when -Winvalid-utf8 is enabled.

cor3ntin updated this revision to Diff 439057.Jun 22 2022, 9:24 AM

Further optimize

cor3ntin added a comment.EditedJun 22 2022, 9:32 AM

@aaron.ballman

I created a 1.3GB file containing 500000 comments of random size and content,
with the following script

import string
import random

for i in range(0, 500000):
    print("/*{}*/".format(''.join(random.choices('\n' + string.ascii_uppercase + string.digits,
                            k=random.randrange(400, 5000)))
    ))
print("int main() {}")

The results are as follow:

Benchmark 1: /home/cor3ntin/dev/compilers/LLVM/build_release_main/bin/clang -Isqlite huge.cpp
  Time (mean ± σ):     221.3 ms ±   4.0 ms    [User: 151.5 ms, System: 69.7 ms]
  Range (min … max):   217.2 ms … 229.6 ms    13 runs
 
Benchmark 2: bin/clang -Isqlite huge.cpp
  Time (mean ± σ):     212.7 ms ±   7.6 ms    [User: 155.2 ms, System: 57.2 ms]
  Range (min … max):   204.6 ms … 226.9 ms    14 runs
 
Benchmark 3: ./bin/clang -Winvalid-utf8 -Isqlite huge.cpp
  Time (mean ± σ):     210.9 ms ±   4.0 ms    [User: 151.8 ms, System: 58.9 ms]
  Range (min … max):   205.7 ms … 220.2 ms    14 runs
 
Summary
  './bin/clang -Winvalid-utf8 -Isqlite huge.cpp' ran
    1.01 ± 0.04 times faster than 'bin/clang -Isqlite huge.cpp'
    1.05 ± 0.03 times faster than '/home/cor3ntin/dev/compilers/LLVM/build_release_main/bin/clang -Isqlite huge.cpp

I also tried sqlite (8MB file with all the sources)

Benchmark 1: /home/cor3ntin/dev/compilers/LLVM/build_release_main/bin/clang -Isqlite sqlite/sqlite3.c
  Time (mean ± σ):      1.852 s ±  0.033 s    [User: 1.782 s, System: 0.069 s]
  Range (min … max):    1.806 s …  1.904 s    10 runs
 
  Warning: Ignoring non-zero exit code.
 
Benchmark 2: bin/clang -Isqlite sqlite/sqlite3.c
  Time (mean ± σ):      1.758 s ±  0.010 s    [User: 1.694 s, System: 0.061 s]
  Range (min … max):    1.738 s …  1.774 s    10 runs
 
  Warning: Ignoring non-zero exit code.
 
Benchmark 3: ./bin/clang -Winvalid-utf8 -Isqlite sqlite/sqlite3.c
  Time (mean ± σ):      1.756 s ±  0.021 s    [User: 1.700 s, System: 0.053 s]
  Range (min … max):    1.741 s …  1.810 s    10 runs
 
  Warning: Ignoring non-zero exit code.
 
Summary
  './bin/clang -Winvalid-utf8 -Isqlite sqlite/sqlite3.c' ran
    1.00 ± 0.01 times faster than 'bin/clang -Isqlite sqlite/sqlite3.c'
    1.05 ± 0.02 times faster than '/home/cor3ntin/dev/compilers/LLVM/build_release_main/bin/clang -Isqlite sqlite/sqlite3.c'

and a file including all the c++ standard headers

Benchmark 1: /home/cor3ntin/dev/compilers/LLVM/build_release_main/bin/clang -Isqlite headers.cpp
  Time (mean ± σ):     956.6 ms ±  12.9 ms    [User: 891.1 ms, System: 63.0 ms]
  Range (min … max):   935.8 ms … 979.5 ms    10 runs
 
  Warning: Ignoring non-zero exit code.
 
Benchmark 2: bin/clang -Isqlite headers.cpp
  Time (mean ± σ):     906.1 ms ±  19.4 ms    [User: 860.1 ms, System: 43.6 ms]
  Range (min … max):   883.8 ms … 941.3 ms    10 runs
 
  Warning: Ignoring non-zero exit code.
  Warning: The first benchmarking run for this command was significantly slower than the rest (941.3 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
 
Benchmark 3: ./bin/clang -Winvalid-utf8 -Isqlite headers.cpp
  Time (mean ± σ):     906.8 ms ±  16.1 ms    [User: 866.5 ms, System: 39.5 ms]
  Range (min … max):   887.6 ms … 940.6 ms    10 runs
 
  Warning: Ignoring non-zero exit code.
 
Summary
  'bin/clang -Isqlite headers.cpp' ran
    1.00 ± 0.03 times faster than './bin/clang -Winvalid-utf8 -Isqlite headers.cpp'
    1.06 ± 0.03 times faster than '/home/cor3ntin/dev/compilers/LLVM/build_release_main/bin/clang -Isqlite headers.cpp'

You will notice i no longer check for whether the diagnostic is enabled preemptively as this turned out to be have negative effects on performance.

I posted bogus results earlier because i was using a pre built clang as the baseline and i suppose it's more optimized than the default build options, or was compiled with a different compiler

It looks like precommit CI caught some issues:

Failed Tests (2):
  Clangd :: utf8.test
  Clangd Unit Tests :: ./ClangdTests.exe/CompletionStringTest/GetDeclCommentBadUTF8
clang/include/clang/Basic/DiagnosticLexKinds.td
116–117

Great, now that it's marked Extension it will show up when someone enables -pedantic (but is otherwise off by default).

clang/lib/Lex/Lexer.cpp
2406–2425

+1 -- I hadn't realized there were other options. However, ultimately I think this should be committed when WG21 has adopted the paper and then it would be even better to have a comment citing which stable name and paragraph number the code is implementing from P2295.

cor3ntin added inline comments.Jun 22 2022, 11:04 AM
clang/lib/Lex/Lexer.cpp
2406–2425

I don't think this applies as we are not inserting replacement characters, but I'm all for adding a comment.
I don't know if we should reference the c++ wording though, as I don;t think there is value in not applying that behavior in all language modes.

2718

This crashes when using _mm_load_si128 which suprises me because CurPtr is supposedly aligned on a 16 bytes boundary here. Any idea?

tahonermann added inline comments.Jun 22 2022, 11:54 AM
clang/lib/Lex/Lexer.cpp
2406–2425

Perhaps phrase it something like "diagnostics are issued consistent with the replacement character insertion strategy of Unicode PR-121 policy option 1". The goal is to make it clear that the diagnostic strategy is not out of thin air; that it follows Unicode endorsed behavior.

aaron.ballman added inline comments.Jun 22 2022, 12:06 PM
clang/lib/Lex/Lexer.cpp
2406–2425

Tom covered why I would prefer to see a comment, but as for the standards reference -- we already put in references to only one language for behavior we treat as an extension, so this would be more of the same. That ends up being helpful (especially if there's an explicit comment about it being an extension in other languages) because it tells the reader how the extension is roughly expected to behave as well.

2718

Wait, did you verify that CurPtr really is on a 16-byte boundary, or are you thinking it should be on such a boundary? (I don't see any alignment markings on the parameter, so I'd assume it's aligned as any other pointer.)

aaron.ballman added inline comments.Jun 22 2022, 12:31 PM
clang/lib/Lex/Lexer.cpp
2718

Derp, I missed that the loop above is manually aligning the pointer.

I'm not certain what's going on here with your crash...

cor3ntin added inline comments.Jun 22 2022, 1:01 PM
clang/lib/Lex/Lexer.cpp
2718

Just above, i do

while (C != '/' && ((intptr_t)CurPtr & 0x0F) != 0) {
        if (!isASCII(C)) {
          CurPtr--;
          break;
        }

Derp indeed (the CurPtr--; break is the culprit, fix shortly)

cor3ntin updated this revision to Diff 439177.Jun 22 2022, 3:17 PM
  • Add a comment referencing the (to be) C++ wording and unicode discussion on replacemet characters
  • Do not try to skip utf8 checks when the warning is not enabled as checking whether the warning is enabled is more expensive
  • Fix a bug causing vectorization to not be aligned
cor3ntin updated this revision to Diff 439276.Jun 23 2022, 1:46 AM
  • Remove explicit load instruction
  • cleanup extra braces

@aaron.ballman FYI this is now core-approved

erichkeane added inline comments.
clang/docs/ReleaseNotes.rst
281
cor3ntin updated this revision to Diff 441748.Jul 1 2022, 11:11 AM

Fix typo.

aaron.ballman added reviewers: tahonermann, Restricted Project.Jul 5 2022, 9:13 AM
aaron.ballman added inline comments.
clang/docs/ReleaseNotes.rst
282

Should we mention P2295R5 now that it's at least core approved? Something like:


...unit sequences in comments; in support of `P2295R5 <http://wg21.link/P2295R5>`_.

My plan is to do that for all papers once past plenary to
keep consistency with cxx_status.

aaron.ballman accepted this revision.Jul 6 2022, 8:25 AM

The changes here LGTM, thank you for this!

This revision is now accepted and ready to land.Jul 6 2022, 8:25 AM
This revision was landed with ongoing or failed builds.Jul 6 2022, 8:59 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Jul 6 2022, 10:18 AM

As far as I can tell, this breaks check-clang everywhere: http://45.33.8.238/

Please take a look and revert for now if it takes a while to fix.

As far as I can tell, this breaks check-clang everywhere: http://45.33.8.238/

Please take a look and revert for now if it takes a while to fix.

Thanks for letting me know, i hadn't noticed the bots screaming at me. I did revert until i can investigate and fix.

cor3ntin reopened this revision.Jul 6 2022, 11:09 AM
This revision is now accepted and ready to land.Jul 6 2022, 11:09 AM
cor3ntin updated this revision to Diff 442642.Jul 6 2022, 11:10 AM

Fix the test of a newly landed commit

This revision was landed with ongoing or failed builds.Jul 6 2022, 12:18 PM
This revision was automatically updated to reflect the committed changes.

The bots found a further issue which i committed a fix for directly https://reviews.llvm.org/D129223 - I should have caught that sooner

cor3ntin reopened this revision.Jul 6 2022, 3:17 PM
This revision is now accepted and ready to land.Jul 6 2022, 3:17 PM
cor3ntin updated this revision to Diff 442702.Jul 6 2022, 3:26 PM

Deploying that turned out to reveal a few critical issues

  • getUTF8SequenceSize never reported a non-zero length for valid

UTF-8 sequences.

  • In *some* circumstances (depending on the size of comment),

Unicode codepoints were parsed from one past their start,
because the CurPtr was sometimes, but not always, moved back.

I also added a test file with *valid* utf-8 in comments
(which would have caught these issues).

@aaron.ballman Ok, this didn't turn out great, and given the changes + the fact it was completely broken (twice!), I'd love if you could take a look at it again. Thanks :)

cor3ntin updated this revision to Diff 443222.Jul 8 2022, 5:40 AM

Fix bound check in the non vectorized case

aaron.ballman accepted this revision.Jul 8 2022, 5:42 AM

I only spotted one thing I think is actually an issue, the rest is style related. LGTM with the one issue fixed.

clang/lib/Lex/Lexer.cpp
2707–2709
2751–2764

< instead of <=?

2753–2755
2761–2763
cor3ntin updated this revision to Diff 443225.Jul 8 2022, 5:47 AM
cor3ntin marked 19 inline comments as done.

Remove superfuous braces.

This revision was landed with ongoing or failed builds.Jul 9 2022, 2:26 AM
This revision was automatically updated to reflect the committed changes.

@aaron.ballman Thanks for the review. I landed the changes and got a bunch of bots screaming at me for changes that are completely unrelated

https://lab.llvm.org/buildbot/#/builders/21/builds/45146
https://lab.llvm.org/buildbot/#/builders/36/builds/22838
https://lab.llvm.org/buildbot/#/builders/121/builds/21183

What do you mean it is completely unrelated? You have __ALTIVEC__-specific changes and it knocks out the PPC bots.

@aaron.ballman Thanks for the review. I landed the changes and got a bunch of bots screaming at me for changes that are completely unrelated

https://lab.llvm.org/buildbot/#/builders/21/builds/45146
https://lab.llvm.org/buildbot/#/builders/36/builds/22838
https://lab.llvm.org/buildbot/#/builders/121/builds/21183

What do you mean it is completely unrelated? You have __ALTIVEC__-specific changes and it knocks out the PPC bots.

Good point. The error was a bit misleading but i guess what's happening is a segfault when running clang-ast-dump.
I'm reverting for now and I don't really know how to go from there as I don't have the hardware to test that code path.

Good point. The error was a bit misleading but i guess what's happening is a segfault when running clang-ast-dump.
I'm reverting for now and I don't really know how to go from there as I don't have the hardware to test that code path.

Sending you info via e-mail on how to request access to systems.

cor3ntin reopened this revision.Jul 9 2022, 11:41 AM
This revision is now accepted and ready to land.Jul 9 2022, 11:41 AM
cor3ntin updated this revision to Diff 443699.Jul 11 2022, 11:32 AM

Fix crash on PowerPC

(I forgot to removed a non-sense line that
could cause the CurPtr to move incorrectly
past a / in the ALTIVEC code path).

aaron.ballman accepted this revision.Jul 12 2022, 5:16 AM

Fix crash on PowerPC

(I forgot to removed a non-sense line that
could cause the CurPtr to move incorrectly
past a / in the ALTIVEC code path).

Ah, nice catch -- LGTM!

This revision was landed with ongoing or failed builds.Jul 12 2022, 5:34 AM
This revision was automatically updated to reflect the committed changes.

I had to revert this because it breaks a bunch of LLDB tests: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/45288/#showFailuresLink. It looks like this causes an error when building some SDK modules.

cor3ntin reopened this revision.Jul 12 2022, 5:14 PM
This revision is now accepted and ready to land.Jul 12 2022, 5:14 PM

I had to revert this because it breaks a bunch of LLDB tests: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/45288/#showFailuresLink. It looks like this causes an error when building some SDK modules.

Thanks (and sorry for the annoyance)

cor3ntin updated this revision to Diff 444118.Jul 12 2022, 5:29 PM

This turned out to be an interesting bug.
The SSE code tried to be clever and skip over valid ascii code units when finding invalid UTF-8.
In doing so, it could run over the end of a comment entirely if

  • there was a short ascii comment
  • followed by a tiny amount of C++
  • followed by another comment containing non-ascii data.

It does not matter whether it was valid or not (which was misleading
as the file that tripped the bot is full of invalid code units.

The problematic test boils down to

enum a {
    x  /* 01234567890ABCDEF*/
};
/*ααααααααα*/

The fix is to do in the SSE codepath what we do in the altivec
and default paths: if we find an invalid code unit,
we rescan that bit of the comment on the slow path
without trying to update CurPtr (and for each code unit,
checking both for isASCIII an != '/' at the same time.

I had to revert this because it breaks a bunch of LLDB tests: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/45288/#showFailuresLink. It looks like this causes an error when building some SDK modules.

Thanks (and sorry for the annoyance)

No worries at all. Thanks for looking into the issue!

This revision was landed with ongoing or failed builds.Jul 13 2022, 1:19 AM
This revision was automatically updated to reflect the committed changes.

@aaron.ballman Thanks for the review. I landed the changes and got a bunch of bots screaming at me for changes that are completely unrelated

https://lab.llvm.org/buildbot/#/builders/21/builds/45146
https://lab.llvm.org/buildbot/#/builders/36/builds/22838
https://lab.llvm.org/buildbot/#/builders/121/builds/21183 free games

Good point. The error was confusing, but it's probably a segfault while running.

@cor3ntin, sorry for failing to keep up with reviews; I know this has already been committed. I did spot a couple of typos should you feel inclined to address them.

clang/lib/Lex/Lexer.cpp
2399
2695

I'll be unavailable the next 2 weeks, feel free to do it if you want!

Fako432 added a subscriber: Fako432.Jul 9 2023, 4:44 PM

What do you mean it is completely unrelated? You have __ALTIVEC__-specific changes and it knocks out the PPC bots.

Good point. The error was a bit misleading but i guess what's happening is a segfault when running clang-ast-dump.
I'm reverting for now and I don't really know how to go from there as I don't have the hardware to test that code path.

You raise a valid argument. The confusion caused by the error could potentially be attributed to a segmentation fault occurring during the execution.