This is an archive of the discontinued LLVM Phabricator instance.

[clang-diff] Fix getStmtValue when dealing with wide, UTF16 UTF32 chars
ClosedPublic

Authored by PRESIDENT810 on May 30 2022, 6:47 AM.

Details

Summary

This fixes https://github.com/llvm/llvm-project/issues/55771.
Directly using StringLiteral::getString for wide string is not currently supported; therefore in ASTDiff, getStmtValue will fail when asserting that the StringLiteral has a width of 1. This patch also covers cases for UTF16 and UTF32 encoding, along with corresponding test cases.

Diff Detail

Event Timeline

PRESIDENT810 created this revision.May 30 2022, 6:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 30 2022, 6:48 AM
PRESIDENT810 requested review of this revision.May 30 2022, 6:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 30 2022, 6:48 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
johannes requested changes to this revision.May 31 2022, 2:48 PM

I wonder if clang-diff is useful in its current state, I remember there are many edge cases left over.

Anyway, I've left some minor comments. I'm not sure how to download a patch that Git can apply so I didn't try it yet.

clang/lib/Tooling/ASTDiff/ASTDiff.cpp
468

I think there are some other cases that need fixing too.

  1. isUTF16() (like u'x')
  2. isUTF32() (like U'x')

Can you fix them too? Should be quite similar.

471

How about using "auto" here to avoid repeating the type?

474

Instead of calling substr() it seems better to move this to the constructor

std::wstring wstr(temp, wsize);

(I think the reason why we need wsize is because strings can have embedded null characters)

This revision now requires changes to proceed.May 31 2022, 2:48 PM

Refactored some code and add support of U16 & U32 characters, as well as tests for them.

johannes requested changes to this revision.Jun 6 2022, 11:09 AM

I've suggested some more refactorings but otherwise this should be good to go

clang/lib/Tooling/ASTDiff/ASTDiff.cpp
497

Sorry for the delay. Looks pretty good already.

think there's a bit of redundancy now, I tried to simplify it, WDYT?
I used uppercase variable names as this is the local convention. I'm not sure what's the global one.
String->getByteLength() / String->getCharByteWidth(); adds an unnecessary division, we can just use getLength().

if (String->isWide() || String->isUTF16() || String->isUTF32()) {
  std::string UTF8Str;
  unsigned int NumChars = String->getLength();
  const char *Bytes = String->getBytes().data();
  if (String->isWide()) {
    const auto *Chars = reinterpret_cast<const wchar_t *>(Bytes);
    if (!convertWideToUTF8({Chars, NumChars}, UTF8Str))
      return "";
  } else if (String->isUTF16()) {
    const auto *Chars = reinterpret_cast<const unsigned short *>(Bytes);
    if (!convertUTF16ToUTF8String({Chars, NumChars}, UTF8Str))
      return "";
  } else {
    assert(String->isUTF32() && "Unsupported string encoding.");
    const auto *Chars = reinterpret_cast<const unsigned int *>(Bytes);
    if (!convertUTF32ToUTF8String({Chars, NumChars}, UTF8Str))
      return "";
  }
  return UTF8Str;
}
clang/test/Tooling/clang-diff-ast.cpp
80

I think two lines per encoding should be enough, so how about doing this right next to the existing string literal?

const char *foo(int i) {
  if (i == 0)
    // CHECK: StringLiteral: foo(
    return "foo";
  // CHECK: StringLiteral: wide(
  (void)L"wide";
  // CHECK: StringLiteral: utf-16(
  (void)u"utf-16";
  // CHECK: StringLiteral: utf-32(
  (void)U"utf-32";
  // CHECK-NOT: ImplicitCastExpr
  return 0;
}
This revision now requires changes to proceed.Jun 6 2022, 11:09 AM

More refactoring following Johannes's suggestion

PRESIDENT810 retitled this revision from [clang-diff] Fix getStmtValue when dealing with wide chars to [clang-diff] Fix getStmtValue when dealing with wide, UTF16 UTF32 chars.
PRESIDENT810 edited the summary of this revision. (Show Details)
johannes accepted this revision.Jun 7 2022, 12:45 AM
johannes added inline comments.
clang/lib/Tooling/ASTDiff/ASTDiff.cpp
477

Nit: I wonder if we should use UTF16 instead of unsigned int. Is there a guiding principle?

478

Nit: is there a reason for explicitly writing the ArrayRef type? Maybe it's an LLVM convention?

This revision is now accepted and ready to land.Jun 7 2022, 12:45 AM

Sorry! I'm a novice at LLVM and I just didn't realize that those types can be implicitly cast to ArrayRef ... I have changed those and it should be fine now!

dyung added a subscriber: dyung.Jun 7 2022, 12:16 PM

Hi, the test you modified in your change seems to be failing on the PS4 linux build bot at https://lab.llvm.org/buildbot/#/builders/139/builds/22964.

******************** TEST 'Clang :: Tooling/clang-diff-ast.cpp' FAILED ********************
Script:
--
: 'RUN: at line 1';   /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/clang-diff -ast-dump /home/buildbot/buildbot-root/llvm-project/clang/test/Tooling/clang-diff-ast.cpp -- -std=c++11 | /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-project/clang/test/Tooling/clang-diff-ast.cpp
--
Exit Code: 1
Command Output (stderr):
--
/home/buildbot/buildbot-root/llvm-project/clang/test/Tooling/clang-diff-ast.cpp:50:12: error: CHECK: expected string not found in input
 // CHECK: StringLiteral: wide(
           ^
<stdin>:35:21: note: scanning from here
 StringLiteral: foo(34)
                    ^
<stdin>:39:2: note: possible intended match here
 StringLiteral: utf-16(38)
 ^
Input file: <stdin>
Check file: /home/buildbot/buildbot-root/llvm-project/clang/test/Tooling/clang-diff-ast.cpp
-dump-input=help explains the following input dump.
Input was:
<<<<<<
            .
            .
            .
           30:  IfStmt(29) 
           31:  BinaryOperator: ==(30) 
           32:  DeclRefExpr: i(31) 
           33:  IntegerLiteral: 0(32) 
           34:  ReturnStmt(33) 
           35:  StringLiteral: foo(34) 
check:50'0                         X~~~ error: no match found
           36:  CStyleCastExpr(35) 
check:50'0     ~~~~~~~~~~~~~~~~~~~~
           37:  StringLiteral(36) 
check:50'0     ~~~~~~~~~~~~~~~~~~~
           38:  CStyleCastExpr(37) 
check:50'0     ~~~~~~~~~~~~~~~~~~~~
           39:  StringLiteral: utf-16(38) 
check:50'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:50'1      ?                          possible intended match
           40:  CStyleCastExpr(39) 
check:50'0     ~~~~~~~~~~~~~~~~~~~~
           41:  StringLiteral: utf-32(40) 
check:50'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~
           42:  ReturnStmt(41) 
check:50'0     ~~~~~~~~~~~~~~~~
           43:  IntegerLiteral: 0(42) 
check:50'0     ~~~~~~~~~~~~~~~~~~~~~~~
           44:  AccessSpecDecl: public(43) 
check:50'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            .
            .
            .
>>>>>>
--

Can you take a look and revert if you need time to investigate?

Ok thanks, I will take a look later. Might be a locale issue

No idea why the conversion fails for the wide string.
First step is to reproduce the problem. I guess we should try in a Ubuntu VM.
Is there an easy way to see if other builders succeeded?

No idea why the conversion fails for the wide string.
First step is to reproduce the problem. I guess we should try in a Ubuntu VM.
Is there an easy way to see if other builders succeeded?

Unfortunately I think you would have to search through the history for it.

I can tell you that our PS4 Windows buildbot did succeed though. https://lab.llvm.org/buildbot/#/builders/216/builds/5492