This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Ignore semicolons, whitespace, and comments in SelectionTree.
ClosedPublic

Authored by sammccall on Jul 30 2019, 3:50 PM.

Details

Summary

Whitespace and comments are a clear bugfix: selecting some
comments/space near a statement doesn't mean you're selecting the
surrounding block.

Semicolons are less obvious, but for similar reasons: these tokens
aren't actually claimed by any AST node (usually), so an AST-based model
like SelectionTree shouldn't take them into account.

Callers may still sometimes care about semis of course:

  • when the selection is an expr with a non-expr parent, selection of the semicolon indicates intent to select the statement.
  • when a statement with a trailing semi is selected, we need to know its range to ensure it can be removed.

SelectionTree may or may not play a role here, but these are separate questions
from its core function of describing which AST nodes were selected.

The mechanism here is the TokenBuffer from syntax-trees. We use it in a
fairly low-level way (just to get boundaries of raw spelled tokens). The
actual mapping of AST nodes to coordinates continues to use the (fairly
mature) SourceLocation based logic. TokenBuffer/Syntax trees
don't currently offer an alternative to getFileRange(), I think.

Diff Detail

Repository
rL LLVM

Event Timeline

sammccall created this revision.Jul 30 2019, 3:50 PM

It occurs to me that claim is O(node_tokens + log total_tokens) which is bad when the nodes are large.

Indeed for an input like namespace { namespace { namespace { ... } } } time is quadratic.

I think this is probably fine in practice. Against adversarial input clang certainly can take exponential time, and easily crash too.

If we do need to fix it my best idea is to give each "uninteresting" TokInfo (that is, !selected || claimed) a pointer to the next TokInfo with different flags. This would allow iteration to quickly skip over contiguous ranges one they had been traversed once.

sammccall updated this revision to Diff 212536.Jul 31 2019, 1:58 AM

Improve comments in SelectedTokens.

kadircet accepted this revision.Jul 31 2019, 6:19 AM

Thanks for the comments, LGTM!

This revision is now accepted and ready to land.Jul 31 2019, 6:19 AM
SureYeaah added inline comments.Jul 31 2019, 7:14 AM
clang-tools-extra/clangd/Selection.cpp
50 ↗(On Diff #212536)

Would this work correctly for nested templates? Or do we need to use the specialized token length function that we use for toHalfOpenFileRange?

sammccall marked 2 inline comments as done.Jul 31 2019, 10:45 AM
sammccall added inline comments.
clang-tools-extra/clangd/Selection.cpp
50 ↗(On Diff #212536)

This uses the token list as an intermediary for matching selected chars with AST nodes. TokenBuffer will indeed by default lex >> as a right shift. So we're buggy here, but I think it *mostly* doesn't matter.

If it's a double template:

  • the innermost template will claim it first, if the template range touches the selection. Problem: if only the first > is selected, the inner template will only be partially selected.
  • the outermost template will not get to claim it at all (if the inner template range touches the selection). Problem: if this is the *only* part of the outer template that's selected, it will be marked unselected. (This should be rare)
  • If the inner template doesn't touch the selection, then the outer template will be selected but only partially, which is actually correct.

Examples:

a<b<c>>
     ~  b=partial (correct)
  ~~~~  b=partial (incorrect: b=complete) <-- this is the worst case
~~~~~~~ a=complete,b=complete (correct)
      ~ a=partial (correct)
     ~~ b=partial (incorrect: a=partial,b=partial)

I'll send a followup to fix this case tomorrow (I think we can just always split the token in half) but I don't think it's critical.

This revision was automatically updated to reflect the committed changes.
sammccall marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2019, 10:54 AM