This is an archive of the discontinued LLVM Phabricator instance.

clang-format: [JS] Handle string literals spanning character classes.
ClosedPublic

Authored by mprobst on Oct 14 2015, 10:42 PM.

Details

Reviewers
djasper
Summary

If a RegExp contains a character group with a quote (/["]/), the
trailing end of it is first tokenized as a string literal, which leads
to the merging code seeing an unbalanced bracket.

This change parses regex literals from the left hand side. That
simplifies the parsing code and also allows correctly handling escapes
and character classes, hopefully correctly parsing all regex literals.

Diff Detail

Event Timeline

mprobst updated this revision to Diff 37446.Oct 14 2015, 10:42 PM
mprobst retitled this revision from to clang-format: [JS] Handle string literals spanning character classes..
mprobst updated this object.
mprobst added a reviewer: djasper.
mprobst added subscribers: klimek, cfe-commits.
mprobst updated this revision to Diff 37447.Oct 14 2015, 11:35 PM

Simplified escape handling.

djasper added inline comments.Oct 16 2015, 3:30 PM
lib/Format/Format.cpp
757

I wonder if escape sequences are also relevant to template strings. If yes, we are probably missing a test ;-).

861

Maybe add a test to see how this goes wrong and specifically also ensure it doesn't crash.

863

Remove the ().

879

I'd probably put this into the other two conditions to reduce indentation. More specifically, I'd do:

if (PrecedesOperand(Prev)) // sink the nullptr check into the function.
  return;

if (Prev->isOneOf(...)) {
}
910

No braces.

mprobst updated this revision to Diff 37654.Oct 16 2015, 4:29 PM
mprobst marked 3 inline comments as done.
  • add test for if (...) /re/;
  • address minor review comments
  • fix unary handling.
lib/Format/Format.cpp
879

Not having a Prev is actually part of the true case (no token/start of file precedes an operand), i.e. if nothing precedes the slash, it must be a regexp, too. Given that, both of the nested checks only run if (Prev), so I think the nesting makes sense.

However it turns out my unary check was slightly off and untested (whoops). Added a test and fixed the code.

mprobst added inline comments.Oct 16 2015, 4:46 PM
lib/Format/Format.cpp
757

You cannot actually escape in a template string - backslashes in them are literal backslashes. I.e. foo\bar` does not work.

However we do have a test for that that is presumably wrong (but still passes with this change):

EXPECT_EQ("var x = ` \\` a`;\n"
          "var y;",
          format("var x = ` \\` a`;\n"
                 "var y;"));

We can fix that in a later change, I think it's unrelated to this change.

djasper added inline comments.Oct 17 2015, 5:02 AM
lib/Format/Format.cpp
859

lowerCamelCase for function names in LLVM.

879

I understand and I actually find that hard to understand. I think a better solution might be to pull out a function for this logic. So that you just call:

if (!canPrecedeRegexLiteral(Prev))
  return;

and then

bool canPrecedeRegeLiteral(const FormatToken *Prev) {
  if (!Prev)
    return true;
  ..
}
unittests/Format/FormatTestJS.cpp
609

This case we could actually fix, right? Because we know that the ")" belongs to the if and if the following token isn't a comment or an open brace, it starts a new statement. Probably shouldn't be done in the same patch, but maybe add a FIXME.

mprobst updated this revision to Diff 37695.Oct 17 2015, 10:10 PM
mprobst marked 3 inline comments as done.
  • add test
  • review comments, fix unary handling.
  • Extract canPrecedeRegexLiteral function.
lib/Format/Format.cpp
879

I see, that's indeed cleaner. Done.

unittests/Format/FormatTestJS.cpp
609

Not sure how we'd do that in the general case.

Regular expression recognition needs to happen in the lexing phase (otherwise it'd greatly complicate everything later on). And in the lexing phase, it's rather complicated to track if the code is in an if statement, as of course the if condition can contain arbitrary code.

Normally the fix for this is having the parser call down into the lexer with the contextual information that at this location a slash introduces a regex token, but we cannot do that, as we are neither parsing here, nor do we have precise grammatical structure around.

djasper accepted this revision.Oct 18 2015, 12:01 AM
djasper edited edge metadata.

Looks good.

This revision is now accepted and ready to land.Oct 18 2015, 12:01 AM
djasper closed this revision.Oct 18 2015, 12:04 AM

Submitted as r250648.