This is an archive of the discontinued LLVM Phabricator instance.

[AST] Added a helper to extract a user-friendly text of a comment.
ClosedPublic

Authored by ilya-biryukov on Apr 24 2018, 4:52 AM.

Details

Summary

The helper is used in clangd for documentation shown in code completion
and storing the docs in the symbols. See D45999.

This patch reuses the code of the Doxygen comment lexer, disabling the
bits that do command and html tag parsing.
The new helper works on all comments, including non-doxygen comments
and is faster. However, it does not understand or transform any
doxygen directives, i.e. cannot extract brief text, etc.

Diff Detail

Repository
rL LLVM

Event Timeline

ilya-biryukov created this revision.Apr 24 2018, 4:52 AM

Added forgotten bits of the change

Overall looks good. Could you add tests for the new methods?

lib/AST/CommentLexer.cpp
294 ↗(On Diff #143881)

micro-nit: I'd probably

return ParseCommands ? lexWithCommands(T) : lexWithoutCommands(T);
471 ↗(On Diff #143881)

Can we share code with lexCommentTextWithCommands for these two common cases?

lib/AST/RawCommentList.cpp
353 ↗(On Diff #143881)

nit: SkipWhitespaces for readability?

380 ↗(On Diff #143881)

Explain when this would be invalid and why TokColumn = 0 is used?

383 ↗(On Diff #143881)

nit: unsigned MaxSkip = IsFirstLine ? ... : ...;

392 ↗(On Diff #143881)

I'd probably make SkipWs return the number of white spaces skipped and do the drop-front here, so that you could simplify the awkward calculation of IndentColumn below.

Overall looks good. Could you add tests for the new methods?

Sure. There are a few tests in D46002, but I haven't (yet) moved them to clang.

ilya-biryukov marked an inline comment as done.
  • Attempt to reuse lexing code with/without command parsing.
  • Get rid of SkipWs.
lib/AST/CommentLexer.cpp
471 ↗(On Diff #143881)

I couldn't come up with a way to do that previsouly.
Made another attempt which seems to work.
Please take a look, the change is somewhat non-trivial (includes removing the loop that seems redundant)

lib/AST/RawCommentList.cpp
380 ↗(On Diff #143881)

I don't know whether this can be even be invalid, but I'm not confident enough to add an assert there.
TokColumn = 0 seems like a reasonable way to recover if we can't compute the column number, i.e. assume the line starts at the first column if SourceLocation of the line was invalid for any reason.

This whole column thing looks weird to me, maybe I should just remove it altogether and just remove the same amount of whitespace in all the lines. WDYT?

383 ↗(On Diff #143881)

That would force to get rid of the comments in the if branches, but they seem to be useful.
Am I missing an obvious style that would preserve the comments?

392 ↗(On Diff #143881)

Got rid of it altogether.
The code seems is clearer now, thanks for the suggestion!

  • Update a comment after latest changes
  • Fix indentation
ilya-biryukov added inline comments.Apr 25 2018, 8:28 AM
lib/AST/RawCommentList.cpp
380 ↗(On Diff #143881)

On a second thought, now I remember why I added this in the first place.
To support the following example we want to take column numbers into account:

class Foo {
    /* A block comment
       spanning multiple lines
       has too many spaces on the all lines except the first one.
    */
    int func();
};
ioeric added inline comments.Apr 25 2018, 11:39 AM
include/clang/AST/RawCommentList.h
118 ↗(On Diff #143930)

I'm trying to understand how these cases and RawComment work.

For this case, are the // ... block and /* ... */ merged in one RawComment by default?

126 ↗(On Diff #143930)

Are the *s in each lines automatically consumed by the lexer?

lib/AST/CommentLexer.cpp
301 ↗(On Diff #143930)

I think we could avoid the "somewhat non-trivial" control flow by merging command and command-less cases in one function:

Something like:

const char *TokenPtr = ...;
auto HandleNonCommandToken = ...;
if (!ParseComand) {
   HandleNonCommandToken(...);
   return;
}
... 
switch (...) {
 // after all command cases 
default:
   HandleNonCommandToken(...);
 }
331 ↗(On Diff #143930)

We should be extra careful about removing the loop... (It does seem to be redundant though)

lib/AST/RawCommentList.cpp
343 ↗(On Diff #143930)

nit: We don't really know if there was a failure or the comment is simply empty. So I'd probably leave out the comment here to avoid confusion.

349 ↗(On Diff #143930)

s/ParseCommentText/ParseCommands/?

352 ↗(On Diff #143930)

This variable could use a comment.

393 ↗(On Diff #143930)

I think MaxSkip could be removed with:

llvm::StringRef Trimmed = TokText.drop_front(IsFirstLine ? WhitespaceLen : 
                                                 std::max((int)IndentColumn - (int)TokColumn, 0)));
380 ↗(On Diff #143881)

So I think you would want to force MaxSkip to 0 if token loc is invalid to make sure no comment is accidentally eaten?

383 ↗(On Diff #143881)

You could simply merge the comments, which doesn't seem to compromise readability.

ilya-biryukov marked 4 inline comments as done.Apr 26 2018, 2:32 AM
ilya-biryukov added inline comments.
include/clang/AST/RawCommentList.h
118 ↗(On Diff #143930)

Yes, RawComment can represent multiple merged comments of different styles.

126 ↗(On Diff #143930)

Yes, see comments::Lexer::skipLineStartingDecorations()

lib/AST/RawCommentList.cpp
343 ↗(On Diff #143930)

getRawTextSlow returns empty string on error, so this comment has some ground.
Removed it anyway

349 ↗(On Diff #143930)

Thanks for catching this!

393 ↗(On Diff #143930)

I've removed MaxSkip, but introduced SkipLen instead that captures an actual numbers of chars we want to skip.
I would still keep as a separate variable, as the expression seems somewhat complex and giving a name to it, arguably, makes the code more readable.

380 ↗(On Diff #143881)

Makes sense, thanks. Unfortunately I don't have any ideas on how we can test this case :-(

ilya-biryukov marked 2 inline comments as done.
  • Remove tryLexCommands(), call into helper that parses commands directly
  • Addressed other review comments

Looks good. We still need tests though :)

lib/AST/RawCommentList.cpp
376 ↗(On Diff #144083)

This is a bit confusing... Could you please add comments about the behavior here (as chatted offline)?

394 ↗(On Diff #144083)

use static_cast instead of conversions.

405 ↗(On Diff #144083)

I think it's end of file?

ilya-biryukov marked 3 inline comments as done.
  • Move unit tests from clangd code to AST tests
  • Assert locations are valid
  • Address review other comments
ilya-biryukov added inline comments.May 8 2018, 6:43 AM
lib/AST/RawCommentList.cpp
376 ↗(On Diff #144083)

After thinking about it for a while, decided to add an assert that location was valid instead.
Invalid locations don't make any sense there, since we won't be able to get the comment text in case of invalid locs.

394 ↗(On Diff #144083)

Done. Rewrote the code to keep the reduce the number of casts too. It was unreadable with 3 static casts and default formatting.

  • Fixed infinite loop with comments that contain doxygen commands
ioeric added a comment.May 9 2018, 2:14 PM

Thanks for adding the tests!

include/clang/AST/RawCommentList.h
138 ↗(On Diff #145691)

I think we can get rid of the interface that takes ASTContext? If SourceManager and Diags are sufficient, I don't see why we would want another interface for ASTContext.

lib/AST/RawCommentList.cpp
352 ↗(On Diff #145691)

I'm not quite sure about this. Could we just require a CommandTraits in the interface? And only make this assumption in tests?

unittests/AST/CommentTextTest.cpp
32 ↗(On Diff #145691)

SourceManagerForFile added in D46176 should save you a few lines here. (I'm landing it right now...)

ilya-biryukov marked an inline comment as done.
  • Simplify test code with SourceManagerForFile.
ilya-biryukov added inline comments.May 11 2018, 7:35 AM
include/clang/AST/RawCommentList.h
138 ↗(On Diff #145691)

Two reasons that come to mind: it's simpler to use and follows the API of getBriefText. If not for mocking the tests, I would totally only keep the ASTContext-based one, since it does not really make any sense to create RawComment without ASTContext for any reason other than testing.

lib/AST/RawCommentList.cpp
352 ↗(On Diff #145691)

I think we shouldn't add this to params, the whole point of this function is to do parsing that ignores the commands and the CommandTraits. The fact that lexer still needs them is because we haven't extracted a simpler interface from Lexer that doesn't rely on unused params, i.e. CommandTraits and Allocator.

unittests/AST/CommentTextTest.cpp
32 ↗(On Diff #145691)

Thanks!

  • Removed the overload that accepts ASTContext
ilya-biryukov marked 2 inline comments as done.May 15 2018, 1:40 AM
ioeric accepted this revision.May 15 2018, 1:41 AM

lgtm

include/clang/AST/RawCommentList.h
138 ↗(On Diff #145691)

As discussed offline, two interfaces that do the same thing are a bit confusing. I would still suggest favoring minimal API and test-ability over simplified usage - two parameters aren't that better than one after all :)

lib/AST/RawCommentList.cpp
352 ↗(On Diff #145691)

Makes sense.

This revision is now accepted and ready to land.May 15 2018, 1:41 AM
This revision was automatically updated to reflect the committed changes.