Page MenuHomePhabricator

[YAML] Add support for non-printable characters

Authored by thegameg on Dec 15 2017, 7:05 AM.



LLVM IR function names which disable mangling start with '\01' (

When an identifier like "\01@abc@" gets dumped to MIR, it is quoted, but only with single quotes.

"The allowed character range explicitly excludes the C0 control block allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF."

"All non-printable characters must be escaped. [...] Note that escape sequences are only interpreted in double-quoted scalars."

This patch adds support for printing non-printable characters between double quotes if needed.

Diff Detail


Event Timeline

thegameg created this revision.Dec 15 2017, 7:05 AM

Also related to PR31743.


469 ↗(On Diff #127122)


thegameg updated this revision to Diff 127134.Dec 15 2017, 7:58 AM
thegameg marked an inline comment as done.

Fix typo in docs.

MatzeB accepted this revision.Dec 15 2017, 1:51 PM

LGTM too with some suggestions/nitpicks:

503–565 ↗(On Diff #127134)

Can't you combine the two functions to a single one so we only need to scan the string once? (Just keep the checks at the begin and end of needsSingleQuote() around and then perform the scan from needsDoubleQuotes() while additionally checking the things from ScalarSafeChars. Instead of returning false/true you would compute the max required quoting)

644–647 ↗(On Diff #127134)

While I guess it's not strictly necessary, how about something along the lines of:

// Note: Special casing the non-printable ASCII characters.
if (S[j] == QuoteChar || (MustQuote == QuotingType::Double && S[j] <= 0x1F)) {
  if (S[j] <= 0x1f) {
    output("\\0x" + hex(S[j]));
  } else {
  i = j+1;

so we get a nice \0x01 output instead of a quoted unprintable character?

This revision is now accepted and ready to land.Dec 15 2017, 1:51 PM
This revision was automatically updated to reflect the committed changes.
thegameg marked 2 inline comments as done.
MatzeB added inline comments.Dec 18 2017, 5:34 PM

This part is sketchy. As we are most probably dealing with UTF-8 without having fully parsed code point here. (Not that the code/comments give any indication that we indeed have UTF-8). While for ASCII (= top bit is clear) UTF-8 and ASCII are the same, for other characters (top bit set) it is not.

I think the best/easiest thing to do here is to always trigger double quoting on (C & 0x80) != 0

thegameg marked an inline comment as done.Dec 19 2017, 3:54 AM
thegameg added inline comments.

Ah, I quickly skipped that part...

Should be fixed in r321068.

Hello @thegameg !
Is is correct that printable UTF-8 characters are also escaped by these changes?
For example, Clang-Tidy output now uses double-quotes for UTF-8:

Message:         "parameter '\xffffffffffffffd0\xffffffffffffffbf\xffffffffffffffd0\xffffffffffffffb0\xffffffffffffffd1\xffffffffffffff80\xffffffffffffffd0\xffffffffffffffb0\xffffffffffffffd0\xffffffffffffffbc\xffffffffffffffd0\xffffffffffffffb5\xffffffffffffffd1\xffffffffffffff82\xffffffffffffffd1\xffffffffffffff80' is unused"

but previously it was simply:

Message:         'parameter ''параметр'' is unused'

I'm using snakeyaml parser which is cannot recognize these sequences. By this, i've got incorrect clang-tidy output.
Also i saw that not all the UTF-8 characters are printed in double quotes.
For example in the same output,

ReplacementText: ' /*параметр*/'

printed in the single quotes.

thegameg marked an inline comment as done.Dec 21 2017, 9:16 AM

Hello @thegameg !
Is is correct that printable UTF-8 characters are also escaped by these changes?

Sorry for breaking that, hope it's fixed in r321283.