This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Change line break behaviour for hoverinfo
ClosedPublic

Authored by lolleko on Mar 12 2020, 12:36 PM.

Details

Summary

parseDocumentation retains hard line breaks and removes soft line breaks inside documentation comments.
Wether a line break is hard or soft is determined by the following rules (some of which have been discussed in https://github.com/clangd/clangd/issues/95):

  • Hard Markdown line breaks (https://github.github.com/gfm/#hard-line-breaks) are retained
  • Line breaks that are preceded by a punctuation are retained
  • Line breaks that are followed by "interesting characters" (e.g. Markdown syntax, doxygen commands) are retained
  • All other line breaks are removed

Since this is my first contribution feel free to be extra thorough.

Related issue: https://github.com/clangd/clangd/issues/95

Diff Detail

Event Timeline

lolleko created this revision.Mar 12 2020, 12:36 PM

I understand the main goal here is preserving intended line breaks in documentation comments? Thanks for tackling this problem!

A design goal here is that the markup objects are the source of truth about the document structure. So if we've decided we want a comment to turn into multiple lines, we should create multiple Paragraph objects, this makes it easy to adjust the rendering strategy and isolate differences between text/markdown.

I think this means this splitting needs to happen on the way in instead of on the way out. In Hover.cpp:

if (!Documentation.empty())
  Output.addParagraph().appendText(Documentation);

would be come something like

parseDocumentationComment(Output, Documentation);

which would ultimately add a paragraph to Output for each line detected in Documentation.

Even though markdown doc comments in cpp are still rare, I removed the markdown escaping

Please let's leave this out, at least from this patch, if it's not the primary goal. This is a complicated tradeoff, there's plenty of reasonably common patterns where we should be escaping, such as vector<string>. Moreover if we're treating punctuation in the comments as formatting, we should likely also be doing so in plain-text mode.

FYI D75687 (less unneccesary escaping in markdown) is also in flight and likely to touch some of the same code.

clang-tools-extra/clangd/FormattedString.cpp
103 ↗(On Diff #250027)

We can separate concerns more clearly by not having this function care about markdown.
Can it have a signature like:

  • vector<StringRef> splitLines(StringRef Text) (which wouldn't normalize whitespace within the lines)
  • vector<string> splitLines(StringRef Text) which would

and then continue to have line rendering handled elsewhere?

lolleko updated this revision to Diff 251334.Mar 19 2020, 4:08 AM
lolleko edited the summary of this revision. (Show Details)

Addressed review.

Paragraphs (lines separated by a double line break \n\n) are currently not retained. Because the current implementation of markup::Paragraph only appends a single white space after it's contents (https://github.com/llvm/llvm-project/blob/e26e9ba288ce156b851504ebbb7d8a775572957c/clang-tools-extra/clangd/FormattedString.cpp#L368).
I think this is semantically wrong because in natural language aswell as markup languages like md, html, ... a paragraph should be followed by a blank line.
But since a lot of code relies on that single white space paragraph, this should be addressed in a different patch.

Thanks, this fits in a lot better!
I think there's some more details that could be handled, e.g. if a line is much shorter than the max line length, then that probably implies it's a "real" break. But if we get the code structure right, these should be easy to plug in later.

Because the current implementation of markup::Paragraph only appends a single white space after it's contents. I think this is semantically wrong because in natural language aswell as markup languages like md, html, ... a paragraph should be followed by a blank line

Yeah, we can look at this again, but we shouldn't do so in this patch. You're right on the semantics, but real implementations (particularly VSCode) use a distasteful amount of whitespace around HTML paragraphs vs other UI elements and don't give us any control over the CSS. (Similarly rendering paragraph breaks as \n\n in plaintext feels much too heavyweight in terminal-based editors in my experience)

BTW, it's really useful to upload diffs with full file context, I think -U99999 will do this, though I tend to use arc diff which always includes it.

clang-tools-extra/clangd/Hover.cpp
524

I think we're going to end up using this elsewhere, e.g. in code completion.
Fine to keep it in Hover for now, but please expose it from hover.h and test it directly (rather than via HoverInfo::present())

This will also allow moving this implementation to the end of the file, rather than interleaving it with unrelated parts of hover logic.

527

nit: input should be StringRef since you never mutate it

529

nit: we'd generally use static/anon-namespace functions rather than lambdas for helpers when we don't need to capture much stuff.

583

I think this could be easier to read by separating the strategy from some of the details.

From my reading of the code, the strategy is:

  • examine the raw lines from the input to determine where true paragraph breaks occur. (In general, based on the end of the previous line and the start of the new line)
  • form each groups of raw lines into a paragraph (by trimming whitespace and \, dropping blank lines, and joining the results with spaces)

Given this, I think we could extract a couple of functions like:

  • bool breakBetween(StringRef Before, StringRef After)
  • std::string joinRawLines(ArrayRef<StringRef>)

and the outer loop becomes simpler, like:

SmallVector<StringRef, N> RawLines;
llvm::SplitString(Input, RawLines, "\n");
ArrayRef Remaining = RawLines;
while (!Remaining.empty()) {
  int Para = 1;
  while (Para < Remaining.size() && !breakBetween(Remaining[Para-1], Remaining[Para]))
    Para++;
  std::string ParaText = joinRawLines(Remaining.take_front(Para));
  if (!ParaText.empty())
    Output.addParagraph().appendText(std::move(ParaText));
  Remaining = Remaining.drop_front(Para);
}

I think this signature for breakBetween in terms of stringrefs would result in the lambdas you have here being a bit more readable, and the main is clearer for not dealing directly with string, whitespace, etc.

clang-tools-extra/clangd/unittests/HoverTests.cpp
1886

@kadircet this feels like a case where having a debug output mode like [Para:foo][Para:bar] would be useful!

lolleko updated this revision to Diff 252006.Mar 23 2020, 6:05 AM

Adressed 2nd Review

The problem with the changes you proposed is that it doesn't handle paragraphs (\n\n).
I think the conditionals required to make that work wouldn't be much cleaner than the current solution.

And I do think parahraphs in comments should be retained. I agree that not every hard line break should be a pargraph.
But actual parahraphs should be retained e.g.:

	/**
	 * Condition for calling UpdateOverlaps() to initialize overlap state when loaded in during level streaming.
	 * If set to 'UseConfigDefault', the default specified in ini (displayed in 'DefaultUpdateOverlapsMethodDuringLevelStreaming') will be used.
	 * If overlaps are not initialized, this actor and attached components will not have an initial state of what objects are touching it,
	 * and overlap events may only come in once one of those objects update overlaps themselves (for example when moving).
	 * However if an object touching it *does* initialize state, both objects will know about their touching state with each other.
	 * This can be a potentially large performance savings during level streaming, and is safe if the object and others initially
	 * overlapping it do not need the overlap state because they will not trigger overlap notifications.
	 * 
	 * Note that if 'bGenerateOverlapEventsDuringLevelStreaming' is true, overlaps are always updated in this case, but that flag
	 * determines whether the Begin/End overlap events are triggered.
	 * 
	 * @see bGenerateOverlapEventsDuringLevelStreaming, DefaultUpdateOverlapsMethodDuringLevelStreaming, GetUpdateOverlapsMethodDuringLevelStreaming()
	 */

I think once proper paragraphs are in. The linebreak parsing could be rewritten into 2 steps:

  1. Split on \n\n to extract actual paragraphs.
  2. Use the solution you proposed to split and join new lines.
lolleko edited the summary of this revision. (Show Details)Mar 23 2020, 6:06 AM
sammccall accepted this revision.Mar 23 2020, 9:15 AM

Thanks for this! Mostly just some cleanups/comment nits left.

The substantive change is around the \ and at end-of-line. It looks like the goal is on improving the parsing of markdown-formatted comments, but we need to focus as well on non-markdown-formatted comments as these are the majority case. (Most of your patch improves these a lot!)

If you don't have commit access yet, I'm happy to land this for you, just let me know when ready.
(A couple of patches like this are good grounds to apply for commit access, you can just link to the review threads)

clang-tools-extra/clangd/Hover.cpp
530

We can make use of some StringRef helpers to make this more readable.
I think this is equivalent (to the whole function body):

return Str.substr(LineBreakIndex + 1).drop_while(isWhitespace).startswith("\n");
534

I'm not altogether happy with these rules: many comments are not markdown, and these don't seem very safe for non-markdown comments.

Comments ending in a trailing slash may be explicitly indicating a soft line break e.g.

// Example usage:
// my_prog --some_flag = foo \
//    --remote_database=/path/to/whatever \
//    input.txt

and trailing whitespace is often just sloppiness.

I did a search over a large corpus of open-source C++.

  • for //.*\ \ $: Of a random 10 results, none appear to be markdown.
  • for //.*\\$: Of a random 10 results, none were markdown. (4 were explicit soft breaks, 4 were ascii art, 2 were soft breaks in commented-out macro definitions)
535

What about:

StringRef Prev = Str.substr(0, LineBreakIndex);
return Prev.endswith("\\") || Prev.endswith("  ")

(this seems more readable, and fixes the bug if LineBreakIndex == 1 and the string is {backslash, newline, ...}.)

544

nit: Punctuation.contains(...)

733

nit: please put the doc on the declaration rather than the definition.

(I think the previous name was more appropriate to the signature, but up to you)

750

rather than repeating the code twice, please just use if isParagraph() || isHard(), with the comment that we may treat paragraph differently

751

nit: llvm tends to use "// FIXME: "

751

please soften this to "maybe".
(Because we need to consider the padding/css issues of doing so, whether a hard break vs double break reflects different intents or just different conventions in different codebases, and whether we have a reasonable way to implement this - paragraphs have leading as well as trailing padding, so it gets complicated fast).

and rephrase to "should distinguish between line breaks and paragraphs".
("Double linebreak" is confusing as it doesn't mean anything at the markup::Document level, and markdown paragraph isn't the same concept as a plaintext blank line, even if they are both encoded as "\n\n")

clang-tools-extra/clangd/unittests/HoverTests.cpp
1891

please add tests starting/ending with newline as these are special cases in the code

1943

please drop these comments, as they don't seem to be true (unless assuming the content is markdown, which it usually isn't)

I think we can drop most of the tests too (for specific characters after newline), as we already have escaping tests and these don't actually test different linebreaking behaviour.

This revision is now accepted and ready to land.Mar 23 2020, 9:15 AM
lolleko updated this revision to Diff 252151.Mar 23 2020, 3:11 PM
This comment was removed by lolleko.
lolleko updated this revision to Diff 252153.Mar 23 2020, 3:14 PM
This comment was removed by lolleko.
lolleko updated this revision to Diff 252159.EditedMar 23 2020, 3:19 PM

Final Cleanup
Removed markdown linebreak parsing for now.

No I don't have commit rights yet. Feel fre to merge this for me.

Thanks for the thorough review.

lolleko edited the summary of this revision. (Show Details)Mar 23 2020, 3:25 PM
Harbormaster completed remote builds in B50174: Diff 252159.
This revision was automatically updated to reflect the committed changes.