[Lexer] Finding beginning of token with escaped new line
ClosedPublic

Authored by idlecode on Mar 8 2017, 10:29 AM.

Details

Summary

Lexer::GetBeginningOfToken produced invalid location when
backtracking across escaped new lines.

This fixes PR26228

Diff Detail

Repository
rL LLVM
idlecode created this revision.Mar 8 2017, 10:29 AM
alexfh requested changes to this revision.Mar 9 2017, 3:30 AM
alexfh added a subscriber: alexfh.
alexfh added inline comments.
lib/Lex/Lexer.cpp
457 ↗(On Diff #91047)

We only care about two specific sequences here: \\\r\n or \\\n, not a backslash followed by arbitrary whitespace.

unittests/Lex/LexerTest.cpp
386 ↗(On Diff #91047)
387 ↗(On Diff #91047)

Variable names should start with a capital letter: TextToLex. Same elsewhere.

402 ↗(On Diff #91047)

Please clang-format.

This revision now requires changes to proceed.Mar 9 2017, 3:30 AM
idlecode marked 3 inline comments as done.Mar 11 2017, 3:36 AM
idlecode added inline comments.
lib/Lex/Lexer.cpp
457 ↗(On Diff #91047)

I just saw that some functions (e.g. line 1285 in this file) accept whitespaces between escape character and new line. How about now?

idlecode updated this revision to Diff 91466.Mar 11 2017, 3:37 AM

Addressed Alexander's comments

alexfh requested changes to this revision.Mar 16 2017, 6:00 AM
alexfh added inline comments.
lib/Lex/Lexer.cpp
457 ↗(On Diff #91047)

Indeed, both clang and gcc accept whitespace between the backslash and the newline character and issue a diagnostic: https://godbolt.org/g/PUCTzF.

This should probably be done similar to Lexer::getEscapedNewLineSize, but in reverse:

assert(isVerticalWhitespace(*P));
--P;
if (P >= BufferStart && isVerticalWhitespace(*P) && *P != P[1]) // Skip the second character of `\r\n` or `\n\r`.
  --P;
// Clang allows horizontal whitespace between backslash and new-line with a warning. Skip it.
while (P >= BufferStart && isHorizontalWhitespace(*P))
  --P;
return P >= BufferStart && *P == '\\';

I'd add a bunch of tests for this function specifically:

<backslash><\r> -> true
<backslash><\n> -> true
<backslash><\r><\n> -> true
<backslash><\n><\r> -> true
<backslash><space><tab><\v><\f><\r> -> true
<backslash><space><tab><\v><\f><\r><\n> -> true
<backslash><\r><\r> -> false
<backslash><\r><\r><\n> -> false
<backslash><\n><\n> -> false
This revision now requires changes to proceed.Mar 16 2017, 6:00 AM
idlecode planned changes to this revision.Mar 20 2017, 1:42 AM
alexfh added a comment.May 9 2017, 5:40 AM

Paweł, are you planning to finish this patch?

Oh, sorry about this - I forgot. I will send patch during this weekend

Oh, sorry about this - I forgot. I will send patch during this weekend

No worries, I just stumbled upon the bug and recalled that there had been a patch to fix it.

idlecode marked an inline comment as done.Jun 6 2017, 11:58 AM
idlecode updated this revision to Diff 102110.Jun 10 2017, 5:41 AM

Added tests for isNewLineEscaped - this fixed some corner cases

Sorry for the delay, I was on vacation.

This looks much better now, thanks! A few more comments though.

lib/Lex/Lexer.cpp
460 ↗(On Diff #102110)

The way the function is exposed to the test may lead to confusion. I'd either properly declare it in the header (and place it in a namespace, if it is not yet) or at least leave a comment here that the function is not static, since it needs to be exposed to the test.

474 ↗(On Diff #102110)

nit: Placing the Str > BufferStart first would make it more obvious that Str can be safely dereferenced.

unittests/Lex/LexerTest.cpp
371–386 ↗(On Diff #102110)

I would better unroll the loop:

auto endsWithEscapedNewline = [] (const char *S) {
  return isNewLineEscaped(S, S + strlen(S) - 1);
};
EXPECT_TRUE(endsWithEscapedNewline("\\\r"));
EXPECT_TRUE(endsWithEscapedNewline("\\\n"));
...
EXPECT_FALSE(endsWithEscapedNewline("\\\r\r"));
...

This would simplify the test and make EXPECT_* macro output sufficient to detect failing patterns without any clarifying messages.

alexfh requested changes to this revision.Jun 26 2017, 8:47 AM
This revision now requires changes to proceed.Jun 26 2017, 8:47 AM
idlecode marked 3 inline comments as done.Aug 4 2017, 8:07 AM
idlecode added inline comments.
lib/Lex/Lexer.cpp
460 ↗(On Diff #102110)

Ok, I have made isNewLineEscaped a static method of Lexer - in Lexer.h there were no global function declaration and I didn't like the idea of introducing one.
I would make it protected/private member but it have to be visible to unit tests (and introducing friend classes just for this method doesn't seem worth it).

idlecode updated this revision to Diff 109740.Aug 4 2017, 8:08 AM
idlecode marked an inline comment as done.
idlecode updated this revision to Diff 109748.Aug 4 2017, 8:35 AM

Applied clang-format

alexfh accepted this revision.Aug 4 2017, 6:30 PM

Looks good with one nit.

Do you need someone to commit the patch for you after you address the comment?

lib/Lex/Lexer.cpp
469–477 ↗(On Diff #109748)

The logic is hard to get here. I'd use a single if and reverse the condition to get rid of the continues:

if (isVerticalWhitespace(*LexStart) && !Lexer::isNewLineEscaped(BufStart, LexStart)) {
  ++LexStart;
  break;
}
This revision is now accepted and ready to land.Aug 4 2017, 6:30 PM

I don't have commit rights yet so I would be grateful for help in this matter :)

lib/Lex/Lexer.cpp
469–477 ↗(On Diff #109748)

Yes, I know - I thought that more vertical code composition would help

idlecode updated this revision to Diff 110029.Aug 7 2017, 10:56 AM

Redability fix in findBeginningOfLine

idlecode marked an inline comment as done.Aug 7 2017, 10:57 AM
This revision was automatically updated to reflect the committed changes.