This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add doxygen parsing for Hover [1/3]
Needs ReviewPublic

Authored by tom-anders on Sep 18 2022, 4:39 AM.

Details

Summary

1/3: Add SymbolDocumentation class to parse Doxygen comments

This commit just adds and tests the a new class for doxygen parsing.
Consumption in Hover and Index will be added in a follow-up commit.

Diff Detail

Event Timeline

tom-anders created this revision.Sep 18 2022, 4:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 18 2022, 4:39 AM
tom-anders published this revision for review.Sep 18 2022, 4:41 AM
tom-anders added reviewers: nridge, sammccall, kadircet.
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 18 2022, 4:41 AM

Hi!

Sorry for letting these series of patches sit around without any comments. We were having some discussions internally to both understand the value proposition and its implications on the infrastructure.
So it'd help a lot if you can provide some more information for use cases, that way we can find a nice scope for this functionality to make sure it provides most of the value and doesn't introduce an unnecessary complexity into rest of the infrastructure and also we should try not to regress indexing of projects that don't have doxygen comments.

So first of all, what are the exact use cases you're planning to address/improve with support for doxygen parsing of comments? Couple that comes to mind:

  • obtaining docs about params & return value
  • stripping doxygen commands
  • treating brief/detail/warning/note differently
  • formatting text within comments (bold etc)
  • getting linebreaks/indent right clangd#1040

any other use cases that you believe are important?

as you might've noticed, this list already talks about dealing with certain doxygen commands (but not all).
that list is gathered by counting occurrences of those commands in a codebase with lots of open-source third_party code. findings and some initial ideas look like this:

  • \brief: ~70k occurrences
    • common but usefulness in practice is unclear
    • can infer for non-doxy too (e.g. first sentence of a regular documentation)
    • maybe just strip (or merge into regular documentation)?
  • \return[s]: 30k occurrences
    • unclear if worth separation in hover, because it might be tied to rest of the documentation (re-ordering concerns)
    • can infer for non-doxy maybe?
    • probably just strip the command and keep the rest as-is.
  • \param: 28k occurrences
    • useful for signature help. maybe hover on func calls
    • probably worth storing in a structured manner.
  • \detail[s]: 2k
  • \p: 20k
  • \code: 1k
  • \warning: 2k
  • \note: 9k
    • (for all of the above) just introduce as formatted text?

what do you think about those conclusions? any other commands that you seem worth giving a thought?
One important concern we've noticed around this part is, re-ordering comment components might actually hinder readability. as the comments are usually written with the assumption that they'll be read from top to bottom, but if we re-order them during presentation (e.g. hover has its own layout) we might start referring to concepts/entities in documentation before they're introduced. so we believe it's important to avoid any sort of re-ordering. this is one of the big down sides for preserving parameter comments in a structured way.

another important thing to consider is trying to heuristically infer some of these fields for non-doxygen code bases as well. that way we can provide a similar experience for both.

some other things to discuss about the design overall:

  • How to store the extra information?
    • Proposal from our side would be to introduce structured storage for the pieces we want (limited), and keep the rest as part of main documentation text while doing stripping/reformatting.
  • What to use as a parser?
    • Clang's doxygen parser actually looks like a great piece of code to re-use, it's unfortunate that it can issue diagnostics, requires AST etc. It'd be great to refactor that into a state where we can use it without any AST or diagnostics, and a minimal SourceManager (this seems feasible to achieve at first glance, as most of these inputs seem to be optional or used in 1 or 2 places).
    • we still need to make sure performance and behaviour on non-doxygen is reasonable though. do you have any numbers here?
  • How to store in the index?
    • If we can strip the parser off the dependencies on an astcontext, diagnostics etc. the best option would be to just store as raw text and run the whole pipeline on demand (e.g. do the doxygen parsing and markdown-ization afterwards). This is the simplest approach as it keeps index interfaces the same.

Happy to move the discussion to some other medium as well, if you would like to have them in discourse/github etc.

Thanks for the detailed feedback! Unfortunately, I’m sick right now, so I probably won’t be able to give a detailed answer until after Christmas.

Happy to move the discussion to some other medium as well, if you would like to have them in discourse/github etc.

Yeah let’s maybe copy your comment back to the GitHub Issue and discuss over there, it seems the issue has lots of subscribers already who perhaps also have some more ideas/usecases.

Thanks for the detailed feedback! Unfortunately, I’m sick right now, so I probably won’t be able to give a detailed answer until after Christmas.

Sorry to hear that. I hope you get better soon (and definitely no rush)!

Happy to move the discussion to some other medium as well, if you would like to have them in discourse/github etc.

Yeah let’s maybe copy your comment back to the GitHub Issue and discuss over there, it seems the issue has lots of subscribers already who perhaps also have some more ideas/usecases.

posted it there, thanks!

This is a beautiful patch, it has helped me understand the hierarchy of clang comment nodes much better.

I will note that it only supports explicit briefs. That is, those for which the \brief or @brief command has appeared. But implicit briefs are also quite prevalent. An implicit brief is simply the text of the first paragraph, after removing the whitespace starting at the beginning of the FullComment, and the absence of a \brief or @brief command anywhere else in the attached javadoc.