This is an archive of the discontinued LLVM Phabricator instance.

[Sema] Handle UTF-8 invalid format string specifiers
ClosedPublic

Authored by bruno on Mar 19 2016, 12:34 PM.

Details

Summary

Improve invalid format string specifier handling by printing out invalid specifiers characters with \x, \u and \U. Previously clang would print gargabe whenever the character is unprintable.

Example, before:

NSLog(@"%\u25B9", 3); => warning: invalid conversion specifier ' [-Wformat-invalid-specifier]

after:

NSLog(@"%\u25B9", 3); => warning: invalid conversion specifier '\u25b9' [-Wformat-invalid-specifier]

Diff Detail

Event Timeline

bruno updated this revision to Diff 51116.Mar 19 2016, 12:34 PM
bruno retitled this revision from to [Sema] Handle UTF-8 invalid format string specifiers.
bruno updated this object.
bruno added a reviewer: rsmith.
bruno added subscribers: cfe-commits, dexonsmith.
rsmith edited edge metadata.Mar 28 2016, 11:22 AM

This patch builds a length-1 ConversionSpecifier but includes the complete code point in the length of the overall format specifier, which is inconsistent. Please either treat the trailing bytes as part of the ConversionSpecifier or revert the changes to ParsePrintfSpecifier and handle this entirely within HandleInvalidConversionSpecifier.

Does the same problem exist when parsing scanf specifiers?

lib/Analysis/PrintfFormatString.cpp
322

in -> is

324

The interpretation of a format string by printf should not depend on the locale, so our parsing of a format string should not either.

bruno added a comment.Mar 28 2016, 1:34 PM

This patch builds a length-1 ConversionSpecifier but includes the complete code point in the length of the overall format specifier, which is inconsistent. Please either treat the trailing bytes as part of the ConversionSpecifier or revert the changes to ParsePrintfSpecifier and handle this entirely within HandleInvalidConversionSpecifier.

Ok, gonna handle this entirely within HandleInvalidConversionSpecifier then.

Does the same problem exist when parsing scanf specifiers?

Yes, I missed that, will update accordingly.

lib/Analysis/PrintfFormatString.cpp
324

llvm::sys::locale::isPrint does not actually do any locale specific check (maybe it should be moved elsewhere for better consitency?):

bool isPrint(int UCS) {
#if LLVM_ON_WIN32
  // Restrict characters that we'll try to print to the lower part of ASCII
  // except for the control characters (0x20 - 0x7E). In general one can not
  // reliably output code points U+0080 and higher using narrow character C/C++
  // output functions in Windows, because the meaning of the upper 128 codes is
  // determined by the active code page in the console.
  return ' ' <= UCS && UCS <= '~';
#else
  return llvm::sys::unicode::isPrintable(UCS);
#endif
}

This logic is needed anyway though. Suggestions?

bruno updated this revision to Diff 51859.Mar 28 2016, 5:22 PM
bruno edited edge metadata.

Update after Richard's review.

  • Handle scanf
  • Properly update ConversionSpecifier
rsmith accepted this revision.Mar 28 2016, 6:06 PM
rsmith edited edge metadata.
rsmith added inline comments.
lib/Analysis/FormatString.cpp
276 ↗(On Diff #51859)

How about using getNumBytesForUTF8(FirstByte) != 1 here?

278 ↗(On Diff #51859)

Perhaps only check &SB, &SB + Len -- it doesn't seem problematic if there's some non-UTF8 data after the specifier.

This revision is now accepted and ready to land.Mar 28 2016, 6:06 PM
bruno closed this revision.Mar 29 2016, 10:41 AM

Thanks Richard. Applied your last comments and committed in r264752