This is an archive of the discontinued LLVM Phabricator instance.

[YAMLParser] Add multi-line literal folding support
ClosedPublic

Authored by HassanElDesouky on May 16 2021, 7:57 PM.

Details

Summary

Last year I was working at Swift to add support for Localization of Compiler Diagnostic Messages. We are currently using YAML as the new diagnostic format. The LLVM::YAMLParser didn't have a support for multiline string literal folding and it's crucial to have that for the diagnostic message to help us keep up with the 80 columns rule. Therefore, I decided to add a multiline string literal folding support to the YAML parser.

Diff Detail

Event Timeline

HassanElDesouky requested review of this revision.May 16 2021, 7:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2021, 7:57 PM

Fixed clang-tidy warning

Not too deep into the YAML world, so just my two cents re the isLineBreak function.

llvm/lib/Support/YAMLParser.cpp
1049

shouldn't this rather be a check whether the line is _empty_ -> whether it contains anything else than \n\r\t ?
Currently, this does not support empty lines with spaces, multiple tabs or windows style line feeds \r\n.

This would be easily solvable by using something like:

for(auto Position = Line.begin(); Position != Line.end(); ++Position)
  if(!isBlankOrBreak(Position))
     return false;
return true;

Renaming the function to isLineEmpty might sense then, too.

HassanElDesouky marked an inline comment as done.Jun 2 2021, 11:00 PM

Not too deep into the YAML world, so just my two cents re the isLineBreak function.

Thank you for the review! I changed the function as suggested.

@silvas sorry for pinging you. but this has been in review since May 17th. Can you take a look at it when you have free time?

@HassanElDesouky sorry, I'm not that familiar with this code.

@Bigcheese can you PTAL?

@dexonsmith Sorry for the ping, but this has been in review for so long. Can you review when you have the time?

@silvas Do you know anyone who can review this? I really wanna this to get merged :(

I'm not sure I'm the best person for the job, but I tried to give my thoughts.

The only correctness issue I'm not certain about is the discussion around what constitutes an "empty" line. My understanding is that lines containing only whitespace can be non-empty, and that only the indent is stripped before checking if there is content in the line. @fodinabor do you have any links that describe your definition of empty? The docs I'm looking at are https://yaml.org/spec/1.2/spec.html#l-empty(n,c)

Other than that I believe your code as-is faithfully implements the YAML spec, but the approach you took doesn't match how I model it in my head, which makes it harder to review. I suggested a different approach that I'm more confident is correct (or at the very least, if it is incorrect it might help me identify the part of the spec I don't understand!)

Let me know what you think!

llvm/lib/Support/YAMLParser.cpp
1535

This appears to be dead, was it used previously?

1684

I'd rather hoist the call to isFoldedBlock up instead of juggling Start like this.

I think even better would be to replace isFoldedBlock with something like:

/// Scan a block scalar style indicator and header.
///
/// Note: This is distinct from scanBlockScalarHeader to mirror the fact that
/// YAML does not consider the style indicator to be a part of the header.
///
/// Return false if an error occurred.
bool scanBlockScalarIndicators(char &StyleIndicator, char &ChompingIndicator, unsigned &IndentIndicator, bool &IsDone);

Then the code here is:

bool Scanner::scanBlockScalar(bool IsLiteral) {
  char StyleIndicator;
  char ChompingIndicator;
  unsigned BlockIndent;
  bool IsDone;
  if (!scanBlockScalarIndicators(StyleIndicator, ChompingIndicator, BlockIndent, IsDone))
    return false;
  ...
  // Here you could do `bool IsFolded = StyleIndicator == '>';`

It would be even better to have this return actual enums instead of chars or bools:

enum BlockStyle { Literal; Folded; };
enum BlockChomping { ... };
bool scanBlockScalarIndicators(BlockStyle &Style, BlockChomping &Chomping, unsigned &Indent, bool &IsDone);

// Now code can check `if (Style == BlockStyle::Folded)`, which I think reads the best out of the options, with the bonus of type safety to catch typos.

I don't know if this case is enough reason to change the style of surrounding code to use enums though, so I'd be fine with the new scan method just accepting a bool& or char& for the style indicator and calling it a day. Do others have any thoughts?

1717–1742

I think this can be simplified if you just think of the folding as a special case at the point this loop encounters a line where LineStart != Current (i.e. a line with content). At that point, for literal blocks we just insert LineBreaks count of newlines as we did previously, whereas for folded blocks we do one of three things:

  • If LineBreaks == 0 we don't append anything.
  • If LineBreaks == 1 we append a single space (this is the "folding").
  • Otherwise, we append LineBreaks - 1 newlines (this is the clause of YAML which allows you to get newlines into the final scalar string by using multiple adjacent newlines; the general case is that the first two newlines together get you into this "no folding" mode, then you can chain as many newlines as you want before encountering content again and going back to "folding" mode).

This avoids maintaining more state in checking the last character in Str, and it shows the common aspects of the two styles more clearly.

I'm not a YAML expert, though, so if I'm just wrong please correct me! This is just based on my understanding after skimming the relevant parts of the YAML spec.

So, I tried to work out the behavior of the spec parser, and I think my understanding of empty is faulty. Compare http://ben-kiki.org/ypaste/data/26953/index.html (all content is non-space characters) with http://ben-kiki.org/ypaste/data/26954/index.html (the middle line content is a single space).

The single space is still considered content (what I would expect), but somehow prevents the folding from occuring (not what I would expect). I think this lines up with @fodinabor definition of isLineEmpty? In that case, I think my suggestion would be:

if (LineStart != Current) {
  if (LineBreaks && IsFolded) {
    // The folded style "folds" any single line break between content into a single space, except when that
    // content is "empty" (only contains whitespace) in which case the line break is left as-is.
    if (LineBreaks == 1) {
      Str.append(LineBreaks, isLineEmpty(StringRef(LineStart, Current-LineStart)) ? '\n' : ' ');
    }
    // If we saw a single line break, we are completely replacing it and so want `LineBreaks == 0`.
    // Otherwise this decrement accounts for the fact that the first line break is "trimmed", only
    // being used to signal a sequence of line breaks which should not be folded.
    LineBreaks--;
  }
  Str.append(LineBreaks, '\n');
  Str.append(StringRef(LineStart, Current - LineStart));
  LineBreaks = 0;
}

@scott.linder Thank you so much for the review!

  1. I made the changes you pointed out regarding defining a new scanBlockScalarIndicators method.
  2. I think the idea of enum BlockStyle and enum BlockChomping is really good. I will definitely implement it but maybe in another PR/diff. -- if that's okay :)
  3. I'm afraid that your suggested code in regarding simplifying the logic behind folding is not correct. It doesn't work with folded scalars that has strip/clip. However, I tried my best to simplify the logic. If you have any other ideas let me know!

@scott.linder Thank you so much for the review!

  1. I made the changes you pointed out regarding defining a new scanBlockScalarIndicators method.
  2. I think the idea of enum BlockStyle and enum BlockChomping is really good. I will definitely implement it but maybe in another PR/diff. -- if that's okay :)

Yes, that sounds great :)

  1. I'm afraid that your suggested code in regarding simplifying the logic behind folding is not correct. It doesn't work with folded scalars that has strip/clip. However, I tried my best to simplify the logic. If you have any other ideas let me know!

I couldn't figure out exactly what was going wrong with my changes, and when I actually build my changes locally it seems to implement the strip/clip/keep as I'd expect. The existing tests don't cover all the combinations, and the contents changing between them makes it harder to compare, so I also updated the test. See the attached patch (which includes your changes, with my proposed edits), which you should be able to apply with git am

.
Can you let me know what I'm missing?

Apologies for this getting sidelined for so long.
Can the build failures be addressed please, it may just be a case of rebasing to trunk as I think the file causing the failures no longer exists on trunk.

HassanElDesouky marked an inline comment as done.Feb 21 2022, 12:34 AM

Apologies for this getting sidelined for so long.
Can the build failures be addressed please, it may just be a case of rebasing to trunk as I think the file causing the failures no longer exists on trunk.

Yeah sure, I'll do it this weekend.

I finally finished working on this. The tests have passed on Windows and the failure on Debian isn't related to my changes. Can we get this merged, please? 🚢
@scott.linder @njames93

HassanElDesouky marked 2 inline comments as done.Feb 25 2022, 7:28 PM
scott.linder accepted this revision.Feb 28 2022, 11:54 AM

LGTM, thank you!

This revision is now accepted and ready to land.Feb 28 2022, 11:54 AM

LGTM, thank you!

Thank you so much for your help last year! This is my very first commit in LLVM. I’m super happy right now 😄

@scott.linder I don’t have a commit permission so I’d love it if you can commit it on my behalf. Thanks!

@scott.linder I don’t have a commit permission so I’d love it if you can commit it on my behalf. Thanks!

Sure thing, I'll run a final check-all and commit it soon! Welcome to LLVM :)

This revision was landed with ongoing or failed builds.Feb 28 2022, 1:04 PM
This revision was automatically updated to reflect the committed changes.