This is an archive of the discontinued LLVM Phabricator instance.

Add a check whether or not a str is utf8 prior to emplacing
ClosedPublic

Authored by lanza on Oct 8 2018, 5:17 PM.

Details

Summary

Highlighing junk data on VSCode can send a query for evaluate which
fails. In particular cases on Windows, this the error message can end
up as a c-string of [-35,-35,-35,-35,...]. Attempting to emplace this
as the error message causes an assert failure.

Prior to emplacing the error message, confirm that it is valid UTF8 to
eliminate errors such as mentione above.

Diff Detail

Repository
rLLDB LLDB

Event Timeline

lanza created this revision.Oct 8 2018, 5:17 PM
clayborg requested changes to this revision.EditedOct 9 2018, 7:23 AM

So any string in the debugger can contain invalid UTF8. This can easily happen for C string summaries or char array summaries from lldb::SBValue objects when a variable isn't yet initialized. It would rather us sanitize any strings that can contain invalid UTF8 and just change any invalid UTF8 sequences into either "\\xXX" where XX is the byte. Notice the extra backslash then the string would show up as \xXX when it gets displayed. Or we can use \uXXXX (notice just a single backslash) which would send the character over as is, just not as invalid UTF8. Best option would be to convert valid UTF8 sequences over to use \uXXXX format (single backslash to correctly encode the character as unicode), and invalid UTF8 sequences to send each invalid byte as \\xXX (double backslash to just display the non printable character correctly. One vote for using the first \\xXX format is it will display correctly in a variable view. If we encode invalid UTF8 sequences over as the actual characters themselves, then the GUI might show something unexpected, or have issues displaying the invalid UTF8.

I would vote to fix this in LLVM's JSON parser when it is encoding any string value. If we can't do this, then we need to make a function that sanitizes any string values and always use that to encode a string correctly.

This revision now requires changes to proceed.Oct 9 2018, 7:23 AM
lanza added a comment.Oct 12 2018, 1:18 PM

You're certainly right. But the goal I'm working on now is simply to get lldb-vscode running on Windows. I'll take note of this task to fix it into the future.

lanza added a comment.Nov 7 2018, 2:46 PM

Okay, so after looking into this it turns out that llvm::json does handle this well:

Value(std::string V) : Type(T_String) {
  if (LLVM_UNLIKELY(!isUTF8(V))) {
    assert(false && "Invalid UTF-8 in value used as JSON");
    V = fixUTF8(std::move(V));
  }
  create<std::string>(std::move(V));
}

it asserts that the string is valid utf8 but will attempt to fix it if noassert is enabled. So we should just call fixUTF8(thestring) within lldb-vscode at the necessary points where we can end up with non-utf8.

lanza updated this revision to Diff 173038.Nov 7 2018, 2:46 PM

fix if possible

clayborg requested changes to this revision.Nov 8 2018, 9:43 AM

So we need to do this everywhere for any string value that gets put into an object. Otherwise we are just waiting for a crash from any string value. Values of variables, summaries from variables, so many other places where we get the string value from the API in LLDB can fail. I am fine with just making a static helper function that does this for us:

void emplace_safe_string(llvm::json::Object &obj, llvm::StringRef key, llvm::StringRef str) {
  if (llvm::json::isUTF8(str))
    obj.try_emplace(key, str.str());
  else
    obj.try_emplace(key, llvm::json::fixUTF8(str));
}

Then we need to find all places where the string value is coming from an API, and use this function.

This revision now requires changes to proceed.Nov 8 2018, 9:43 AM
lanza updated this revision to Diff 174134.Nov 14 2018, 5:16 PM

add more fixes

clayborg accepted this revision.Nov 14 2018, 9:52 PM

Just reverse if statement as noted in inlined comments and this is good to go.

tools/lldb-vscode/JSONUtils.cpp
28–31

Reverse the if for clarity:

if (LLVM_LIKELY(llvm::json::isUTF8(str)))
    obj.try_emplace(key, str.str());
  else
    obj.try_emplace(key, llvm::json::fixUTF8(str));
}
This revision is now accepted and ready to land.Nov 14 2018, 9:52 PM
This revision was automatically updated to reflect the committed changes.