This is an archive of the discontinued LLVM Phabricator instance.

[clang-tidy][modernize-raw-string-literal] Don't replace upper ASCII with raw literals
ClosedPublic

Authored by zinovy.nis on Apr 22 2018, 1:36 PM.

Details

Summary

It's useless and not safe to replace UTF-8 encoded with escaped ASCII to raw UTF-8 chars:

"\xE2\x98\x83" ---> "☃" (snowman)

Especially there's an explicit test for ASCII in this check.

Diff Detail

Event Timeline

zinovy.nis created this revision.Apr 22 2018, 1:36 PM
zinovy.nis added inline comments.Apr 22 2018, 1:38 PM
test/clang-tidy/modernize-raw-string-literal.cpp
42

By the way, AFAIK the lines above are not checked for fixes (and fixes absence) at all! Is it correct?

Eugene.Zelenko edited subscribers, added: xazax.hun, LegalizeAdulthood; removed: Restricted Project.
aaron.ballman added inline comments.Apr 23 2018, 11:23 AM
clang-tidy/modernize/RawStringLiteralCheck.cpp
70–72

I think you can use isASCII() from CharInfo.h rather than reimplement it.

test/clang-tidy/modernize-raw-string-literal.cpp
44

IIRC, the default behavior is that if no matching CHECK-FIXES line is found, then it is considered a failure. Have you tried your test code without your change to verify that this is the case?

zinovy.nis marked an inline comment as done.Apr 23 2018, 11:30 PM
zinovy.nis added inline comments.
clang-tidy/modernize/RawStringLiteralCheck.cpp
70–72

Nice finding! Thanks!

test/clang-tidy/modernize-raw-string-literal.cpp
44
  1. Without my fix my test fails with a Python decoder error as it cannot print Unicode symbols in Windows console. But no FileCheck errors occur:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 974: ordinal not in range(128)
  1. Regardless of my changes when I replace "char const *const Delete("\\\177");" with "char const *const Delete("\000\\\177");" (leading \0) test still passes! Looks like CHECK-FIXes must be explicit.
zinovy.nis marked an inline comment as done.
  • Use clang::isASCII instead of home-brewed code.
zinovy.nis marked 4 inline comments as done.Apr 23 2018, 11:32 PM
aaron.ballman added inline comments.Apr 24 2018, 4:45 AM
clang-tidy/modernize/RawStringLiteralCheck.cpp
71

I am starting to think that this functionality should be refactored because the check is now O(N^2) in the worst case because all of the bytes of the string need to be touched twice. It would be nice for performance reasons to combine this so that there's only a single pass over all of the characters.

What do you think?

zinovy.nis added inline comments.Apr 24 2018, 5:51 AM
clang-tidy/modernize/RawStringLiteralCheck.cpp
71

Sorry, but why O(N^2)? isASCII is O(1), it's just return C<=127. find_if_not is O(N).

aaron.ballman added inline comments.Apr 24 2018, 6:02 AM
clang-tidy/modernize/RawStringLiteralCheck.cpp
71

The find_if_not() call you add touches every character in the string to see if it's ASCII and it all characters are ASCII, then containsEscape() calls find() which touches every character again (assuming the searched character is never encountered).

Looking a bit deeper, isRawStringLiteral() also calls find(), but it asserts that the character is found, so not every character is touched in the string and it should find a the searched character quite quickly.

zinovy.nis added inline comments.Apr 24 2018, 6:07 AM
clang-tidy/modernize/RawStringLiteralCheck.cpp
71

OK, I'll see how to combine theses checks into a single one.

But anyway I see only 2*O(N), not O(N^2) here.

aaron.ballman added inline comments.Apr 24 2018, 6:34 AM
clang-tidy/modernize/RawStringLiteralCheck.cpp
71

Oh, derp, that's my thinko -- sorry! You are correct, that's 2 * O(N) and not O(N^2).

zinovy.nis added inline comments.Apr 24 2018, 1:59 PM
clang-tidy/modernize/RawStringLiteralCheck.cpp
71

May be we can use here std::isprint

  • Optimized containsEscapedCharacters not to re-create bitset<char> (implicitly in StringRef::find_first_of) for each literal.
  • Merged 2 passes for testing for allowed chars into a single one.
zinovy.nis edited the summary of this revision. (Show Details)Apr 25 2018, 11:02 PM
zinovy.nis marked 5 inline comments as done.
zinovy.nis marked an inline comment as done.Apr 25 2018, 11:05 PM

Aaron, any comments for the new revision?

Other than a few minor nits, ship it.

clang-tidy/modernize/RawStringLiteralCheck.cpp
110

"Upper ASCII" is a misnomer. ASCII only defines character codes 0-127. Non-ASCII codes are what is being described here.

111

Why does this loop go down instead of up? I would have expected a more conventional

for (unsigned char C = 0x80u; C <= 0xFFu; ++C)

clang-tidy/modernize/RawStringLiteralCheck.h
39

Consider introducing a typedef for this ugly type name.

test/clang-tidy/modernize-raw-string-literal.cpp
44

Consider adding CHECK-FIXES lines for the other test cases that should remain unmolested by the check.

This revision is now accepted and ready to land.May 1 2018, 10:21 AM
zinovy.nis added inline comments.May 1 2018, 10:32 AM
clang-tidy/modernize/RawStringLiteralCheck.cpp
111

The only reason is that ++C for C==0xFFu is 0 so C <= 0xFF is always satisfied leading to inifinite loop.

clang-tidy/modernize/RawStringLiteralCheck.cpp
111

OK, makes sense. I rarely write loops that iterate over character types.

zinovy.nis marked an inline comment as done.May 1 2018, 11:35 AM
zinovy.nis added inline comments.
test/clang-tidy/modernize-raw-string-literal.cpp
44

I'm going to make a separate commit for that, OK?

zinovy.nis marked an inline comment as done.May 1 2018, 11:38 AM
zinovy.nis closed this revision.May 1 2018, 1:25 PM

Fixed in git-svn-id: https://llvm.org/svn/llvm-project/clang-tools-extra/trunk@331297 91177308-0d34-0410-b5e6-96231b3b80d8

alexfh added inline comments.May 2 2018, 3:41 AM
test/clang-tidy/modernize-raw-string-literal.cpp
42

These test cases without corresponding CHECK-FIXES/CHECK-MESSAGES lines will work (that is ensure no fixes are applied to the code) under the assumption that all warnings explicitly covered by CHECK-MESSAGE lines will only contain fixes for relevant code (in most cases in the same line), and there are CHECK-FIXES for all of them.

And if there is no CHECK-MESSAGES line then FileCheck -implicit-check-not='warning|error' -check-prefix=CHECK-MESSAGES (used in the check_clang_tidy.py) will fail if there is an unaccounted (i.e. not covered by a CHECK-MESSAGES: pattern) warning.

The assumption won't work if the check makes distant fixes (for example, if it inserts #include directives, there should be a CHECK-FIXES line for that).

Thus there's not much value in checking these lines are left unmodified. But in case there is a warning which should contain no fix, checking that the code was not modified by clang-tidy is quite useful.

We could make this more explicit in different ways:

  1. Run FileCheck on the result of diff original-file file-processed-by-clang-tidy. That may make tests much more verbose.
  2. Run FileCheck on the result of clang-tidy -export-fixes. That would make the tests both more verbose and harder to verify for sanity (applying textual replacements with given byte offsets is not what a human brain does well).

Better ideas are welcome.