This is an archive of the discontinued LLVM Phabricator instance.

Stringizing raw string literals containing newline
ClosedPublic

Authored by twoh on Oct 24 2017, 11:58 PM.

Details

Summary

This patch implements 4.3 of http://open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4220.pdf. If a raw string contains a newline character, replace each newline character with the \n escape code. Without this patch, included test case (macro_raw_string.cpp) results compilation failure.

Event Timeline

twoh created this revision.Oct 24 2017, 11:58 PM
twoh updated this revision to Diff 120195.Oct 25 2017, 12:01 AM

clang-format

twoh added a comment.Nov 1 2017, 1:28 PM

Ping. Thanks!

On the first glance string erase and insert seem to be done correctly. But we can still improve the tests. Can you please add tests for tricky cases where new line or backslash appear in unexpected places? Some of the examples are

R"ab\
cd"

R"ab

c"

R"ab
\"

R"ab
\\"

I didn't check if all of them are valid but I hope you got the idea.

lib/Lex/Lexer.cpp
224

getEscapedNewLineSize mentions

P[-1] is known to be a "\" or a trigraph equivalent on entry to this function.

Is this precondition correct in this case?

And std::string::substr creates a copy of a substring. It is inefficient in the loop and looks like you don't really need std::string here anyway.

twoh updated this revision to Diff 123658.Nov 20 2017, 2:34 PM

Addressing @vsapsai's comments. Thank you for the suggestion! Added test case actually finds an off-by-one error in the original patch. I improved the comments as well.

jkorous-apple added inline comments.
lib/Lex/Lexer.cpp
218

Wouldn't auto or typename T::size_type instead of unsigned be more appropriate here?

Both of your supported use cases have this member type.
http://llvm.org/doxygen/classllvm_1_1StringRef.html#a54e59e2d53e5ee736ee060be7c457508
http://llvm.org/doxygen/classllvm_1_1SmallVectorImpl.html#acc72e8846802a1e703501219cf19458e

232

I am just wondering if potential performance benefit of counting all the extra space in advance and resizing the string just once might be interesting here.
Basically with current approach characters at the end of the string are moved as many times as there are endlines in the string.

twoh updated this revision to Diff 124515.Nov 27 2017, 10:08 PM

Thanks @jkorous-apple for your comments. I modified the type for the variables and replaced unnecessary inserts and erases with updates.

I am sorry I wasn't really clear. What I meant was to do something like this (pseudo-code, dealing only with newlines):

if( Str.size() == 0)
  return;

// Calculate all the extra space needed first.

typename T::size_type extra_space = 0;
bool previous_char_was_endline = false;

for(const auto ch : Str) {
  if( ch == '\n' || ch == '\r' ) {
    if( !previous_char_was_endline )
      ++extra_space;
    previous_char_was_endline = true;
  } else {
    previous_char_was_endline = false;
  }
}

if (extra_space == 0)
  return; 

// Resize the string.

const typename T::size_type original_size = Str.size();
Str.resize(original_size + extra_space);

// Iterate backwards and move characters as needed.

bool is_in_block_of_endlines = false;

for(typename T::size_type i = original_size - 1; i > 0; --i) {
  if( Str[i] == '\n' || Str[i] == '\r' ) {
    if (!is_in_block_of_endlines) {
      Str[i + extra_space - 1] = '\\';
      Str[i + extra_space] = 'n';
      --extra_space;
      if(extra_space == 0)
        return; // early exit - no more characters need to be moved
    }
    is_in_block_of_endlines = true;
  } else {
    Str[i + extra_space] = Str[i];
    is_in_block_of_endlines = false;
  }
}

This is just a suggestion, it is a bit more complicated but should be O(lenght_of_string) whereas your solution is a bit more straightforward but is more like O(length_of_string * number_of_endlines_in_string). I leave it up to you what is better here. If you decide to go this way, please assume my pseudocode is buggy and don't rely on it.

twoh added a comment.Nov 29 2017, 10:29 AM

@jkorous-apple Thanks for the comments! Yeah, I was thinking of O(lenght_of_string) approach, but considering the complicatedness of the implementation (I guess the real implementation would be a bit more complex than your pseudo implementation to handle quote and '\n\r' '\r\n' cases) I decided to stay with O(length_of_string * number_of_endlines_in_string) but optimizing the number of move operations.

@vsapsai @jkorous-apple, I wonder if you can actually approve the patch or suggest other reviewers? Thanks!

Thank you for your patience @twoh and sorry for the delay.
I have few suggestions about doxygen annotations but otherwise LGTM.

include/clang/Lex/Lexer.h
247 ↗(On Diff #124515)

Shouldn't we put all the details from implementation annotation here as well (since this is the public interface that people will actually use)?

251 ↗(On Diff #124515)

Shouldn't we put all the details from implementation annotation here as well (since this is the public interface that people will actually use)?

lib/Lex/Lexer.cpp
215

I am not sure I understand this correctly but wouldn't it be more precise if these literals are escaped?
... escaping '\' ... -> ...escaping '\\' ...
... with "\n" ... -> ... with "\\n"

Alternatively we could use R"(\)" and R"(\n)".

twoh updated this revision to Diff 125368.Dec 4 2017, 10:35 AM

Thanks @jkorous-apple for the comment. I think your suggestion is a more
precise description for the implementation, and adjusted the comments
accordingly.

I intentionally didn't add the details to the comments for the header,
as LLVM coding standard "Don’t duplicate the documentation comment in
the header file and in the implementation file. ... implementation files
can include additional comments (not necessarily in Doxygen markup) to
explain implementation details as needed." I'd like to leave it as of
now if you're not strongly against it.

Thank you again!

I see your concern. Would you just consider moving the annotation to header file then? It looks more like documentation of public interface than implementation details to me. I imagine any programmer interested in these methods needs to know these details so they would probably appreciate not having to search for it.

Anyway, I don't want to dwell on minor details and it's probably a matter of opinion as well. If you prefer to leave it as it is feel free to commit your patch.

Good job!

LGTM.

jkorous-apple accepted this revision.Dec 5 2017, 4:33 AM
This revision is now accepted and ready to land.Dec 5 2017, 4:33 AM
twoh updated this revision to Diff 125563.Dec 5 2017, 9:48 AM

@jkorous-apple Got it. I agree that it would be better to move the comments to the header. Will land it soon. Thanks!

twoh updated this revision to Diff 125564.Dec 5 2017, 10:06 AM

clang-format

This revision was automatically updated to reflect the committed changes.