This patch implements 4.3 of http://open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4220.pdf. If a raw string contains a newline character, replace each newline character with the \n escape code. Without this patch, included test case (macro_raw_string.cpp) results compilation failure.
Details
Diff Detail
- Repository
- rC Clang
Event Timeline
On the first glance string erase and insert seem to be done correctly. But we can still improve the tests. Can you please add tests for tricky cases where new line or backslash appear in unexpected places? Some of the examples are
R"ab\ cd" R"ab c" R"ab \" R"ab \\"
I didn't check if all of them are valid but I hope you got the idea.
| lib/Lex/Lexer.cpp | ||
|---|---|---|
| 242 | getEscapedNewLineSize mentions 
 Is this precondition correct in this case? And std::string::substr creates a copy of a substring. It is inefficient in the loop and looks like you don't really need std::string here anyway. | |
Addressing @vsapsai's comments. Thank you for the suggestion! Added test case actually finds an off-by-one error in the original patch. I improved the comments as well.
| lib/Lex/Lexer.cpp | ||
|---|---|---|
| 217 | Wouldn't auto or typename T::size_type instead of unsigned be more appropriate here? Both of your supported use cases have this member type. | |
| 231 | I am just wondering if potential performance benefit of counting all the extra space in advance and resizing the string just once might be interesting here. | |
Thanks @jkorous-apple for your comments. I modified the type for the variables and replaced unnecessary inserts and erases with updates.
I am sorry I wasn't really clear. What I meant was to do something like this (pseudo-code, dealing only with newlines):
if( Str.size() == 0)
  return;
// Calculate all the extra space needed first.
typename T::size_type extra_space = 0;
bool previous_char_was_endline = false;
for(const auto ch : Str) {
  if( ch == '\n' || ch == '\r' ) {
    if( !previous_char_was_endline )
      ++extra_space;
    previous_char_was_endline = true;
  } else {
    previous_char_was_endline = false;
  }
}
if (extra_space == 0)
  return; 
// Resize the string.
const typename T::size_type original_size = Str.size();
Str.resize(original_size + extra_space);
// Iterate backwards and move characters as needed.
bool is_in_block_of_endlines = false;
for(typename T::size_type i = original_size - 1; i > 0; --i) {
  if( Str[i] == '\n' || Str[i] == '\r' ) {
    if (!is_in_block_of_endlines) {
      Str[i + extra_space - 1] = '\\';
      Str[i + extra_space] = 'n';
      --extra_space;
      if(extra_space == 0)
        return; // early exit - no more characters need to be moved
    }
    is_in_block_of_endlines = true;
  } else {
    Str[i + extra_space] = Str[i];
    is_in_block_of_endlines = false;
  }
}This is just a suggestion, it is a bit more complicated but should be O(lenght_of_string) whereas your solution is a bit more straightforward but is more like O(length_of_string * number_of_endlines_in_string). I leave it up to you what is better here. If you decide to go this way, please assume my pseudocode is buggy and don't rely on it.
@jkorous-apple Thanks for the comments! Yeah, I was thinking of O(lenght_of_string) approach, but considering the complicatedness of the implementation (I guess the real implementation would be a bit more complex than your pseudo implementation to handle quote and '\n\r' '\r\n' cases) I decided to stay with O(length_of_string * number_of_endlines_in_string) but optimizing the number of move operations.
@vsapsai @jkorous-apple, I wonder if you can actually approve the patch or suggest other reviewers? Thanks!
Thank you for your patience @twoh and sorry for the delay.
I have few suggestions about doxygen annotations but otherwise LGTM.
| include/clang/Lex/Lexer.h | ||
|---|---|---|
| 250 | Shouldn't we put all the details from implementation annotation here as well (since this is the public interface that people will actually use)? | |
| 251 | Shouldn't we put all the details from implementation annotation here as well (since this is the public interface that people will actually use)? | |
| lib/Lex/Lexer.cpp | ||
| 214 | I am not sure I understand this correctly but wouldn't it be more precise if these literals are escaped? Alternatively we could use R"(\)" and R"(\n)". | |
Thanks @jkorous-apple for the comment. I think your suggestion is a more
precise description for the implementation, and adjusted the comments
accordingly.
I intentionally didn't add the details to the comments for the header,
as LLVM coding standard "Don’t duplicate the documentation comment in
the header file and in the implementation file. ... implementation files
can include additional comments (not necessarily in Doxygen markup) to
explain implementation details as needed." I'd like to leave it as of
now if you're not strongly against it.
Thank you again!
I see your concern. Would you just consider moving the annotation to header file then? It looks more like documentation of public interface than implementation details to me. I imagine any programmer interested in these methods needs to know these details so they would probably appreciate not having to search for it.
Anyway, I don't want to dwell on minor details and it's probably a matter of opinion as well. If you prefer to leave it as it is feel free to commit your patch.
Good job!
LGTM.
@jkorous-apple Got it. I agree that it would be better to move the comments to the header. Will land it soon. Thanks!
Shouldn't we put all the details from implementation annotation here as well (since this is the public interface that people will actually use)?