This is an archive of the discontinued LLVM Phabricator instance.

[clangd] findNearbyIdentifier(): guaranteed to give up after 2^N lines
ClosedPublic

Authored by ArcsinX on Sep 18 2020, 1:55 AM.

Details

Summary

As @kadircet mentions in D84912#2184144, findNearbyIdentifier() traverses the whole file if there is no identifier for the word.
This patch ensures give up after 2^N lines in any case.

Diff Detail

Event Timeline

ArcsinX created this revision.Sep 18 2020, 1:55 AM
ArcsinX requested review of this revision.Sep 18 2020, 1:55 AM
ArcsinX updated this revision to Diff 292739.Sep 18 2020, 3:09 AM

Fix format

ArcsinX updated this revision to Diff 293665.Sep 23 2020, 12:56 AM

std::pow => bitwise shift.
Take care about integers overflow.

I feel like I'm doing something totally wrong here :)
Could someone give me an advice?

Hey! Sorry for the late reply, this has been open in my tabs since day 1 just didn't get a chance to take a look at it.

The biggest problem I see is, this is not changing the fact that we are still traversing the whole file:

  • You do one traversal over the whole file to find FileLines
  • Then possibly two more to find OffsetMin and OffsetMax

You can get rid of the first one by just using getLocForEndOfFile if positionToOffset fails.
For the latter, you can use SourceManager::translateLineCol instead, it uses a cache and cheaper than the linear scan performed by positionToOffset in case of multiple calls. The cache is already populated once we issue the first Cost call, through SM.getSpellingLineNumber

I was reluctant overall as the wins aren't clear. We are still going to traverse the whole file at least once, and readability is hindered but I suppose with the above mentioned two trics, runtime shouldn't be regressed.
Also it is nice to get rid of 2 log calls for each token. Again all of these feels like some micro optimizations that aren't justified.

sammccall added inline comments.Sep 23 2020, 11:52 PM
clang-tools-extra/clangd/XRefs.cpp
564–565

Since costs are only compared, we can simplify this:

return (Line >= WordLine) ? (Line - WordLine) : 2 * (WordLine - Line)

564–565

this has changed the relative ordering, if we're dropping the log then +1 should now become multiplication by two.
(Or was this intended?)

568–586

The initial value of BestCost was chosen so that no-result would be worse than any cost in the eligible range, but better than any cost outside it.
If you're going to truncate the search region, you can just initialize to -1 (i.e. max).

(FWIW, I guess sizeof(unsigned) * 8 - 1 -> std::numeric_limits<unsigned>::digits - 1 would be more obvious)

570

I think simplest is to use SourceMgr's line table.

int MinLine = (signed)Line - Word.Text.size()/2 , MaxLine = Line + Word.Text.size();
SourceLocation LocMin = SM.translateLineCol(File, std::max(1, MinLine), 1);
SourceLocation LocMax = SM.translateLineCol(File, MaxLine); // past-end ok
580

why can't this happen anymore?

The biggest problem I see is, this is not changing the fact that we are still traversing the whole file:

  • You do one traversal over the whole file to find FileLines
  • Then possibly two more to find OffsetMin and OffsetMax

Seems you are right, but we do not compare strings during traversal to find FileLines.

You can get rid of the first one by just using getLocForEndOfFile if positionToOffset fails.
For the latter, you can use SourceManager::translateLineCol instead, it uses a cache and cheaper than the linear scan performed by positionToOffset in case of multiple calls. The cache is already populated once we issue the first Cost call, through SM.getSpellingLineNumber

I was reluctant overall as the wins aren't clear. We are still going to traverse the whole file at least once, and readability is hindered but I suppose with the above mentioned two trics, runtime shouldn't be regressed.
Also it is nice to get rid of 2 log calls for each token. Again all of these feels like some micro optimizations that aren't justified.

Thanks for your reply, I will rethink this patch.

clang-tools-extra/clangd/XRefs.cpp
564–565

Yes, you are right, my fault here.

But should we penalize backwards so hard?

580

As I understand, we call findNearbyIdentifier() only when the word is not an identifier. And Tok is always identifier here.

Btw, even if findNearbyIdentifier() could be called for an identifier, I could not get why it's bad to return current position here.

Am I wrong?

Thanks for your reply, I will rethink this patch.

FWIW I think if we drop most of the math in favor of using SourceManager's line translation, this doesn't add much complexity and is probably a bit more direct and efficient.

clang-tools-extra/clangd/XRefs.cpp
580

As I understand, we call findNearbyIdentifier() only when the word is not an identifier

I'm not sure where you're getting that.
If you mean the bail-out inside the function itself:

// Don't use heuristics if this is a real identifier.
// Unlikely identifiers are OK if they were used as identifiers nearby.
if (Word.ExpandedToken)
  return nullptr;

then "real identifier" here means it's an identifier after preprocessing.
We're still getting here if it's e.g. part of a macro body, or code that's #ifdef'd out.
The spelled token in those cases is an identifier, there's just no corresponding expanded token.

(I'm surprised we don't have a test covering this though!)

Btw, even if findNearbyIdentifier() could be called for an identifier, I could not get why it's bad to return current position here.

The point of this function is that *this* occurrence of the word can't be resolved, so let's try another one nearby.
If we just return the same occurrence, then we're certainly not going to be able to resolve it.

ArcsinX added inline comments.Sep 24 2020, 1:30 AM
clang-tools-extra/clangd/XRefs.cpp
580

Thanks for your clarification. I will revert this change (and will try to add a test as well).

ArcsinX updated this revision to Diff 294616.Sep 28 2020, 12:24 AM

Use SourceManager::translateLineCol(), code simplifications.

ArcsinX added inline comments.Sep 28 2020, 12:31 AM
clang-tools-extra/clangd/XRefs.cpp
580

I failed to create a test for this.
I can create a test for identifiers in a macro body or under #ifdef, but such a test passes without if (Tok.location() == Word.Location) because of this check:

if (!(tokenSpelledAt(Tok.location(), TB) || TB.expansionStartingAt(&Tok)))
  return false;
589

here

ArcsinX marked 5 inline comments as done.Sep 28 2020, 11:59 PM
sammccall accepted this revision.Sep 29 2020, 12:50 AM
sammccall added inline comments.
clang-tools-extra/clangd/XRefs.cpp
572

std::min(Word.Text.size(), numeric_limits<unsigned>::digits() - 1) to avoid UB :-(

572

name WordGain is unclear to me: MaxDistance?

574

cares about -> can handle?

578

I think this is backwards: min should divide wordgain by two, max should not?

This revision is now accepted and ready to land.Sep 29 2020, 12:50 AM
ArcsinX updated this revision to Diff 294924.Sep 29 2020, 3:27 AM

Fix possible UB at bitwise shift.
WordGain => MaxDistance.
Fix LineMin and LineMax values.
Fix comment.

ArcsinX marked 4 inline comments as done.Sep 29 2020, 3:30 AM
ArcsinX closed this revision.Sep 29 2020, 10:24 AM

Don't know why this didn't close automatically
Commit: https://reviews.llvm.org/rGd8ba6b4ab3eceb6bbcdf4371d4ffaab9d1a5cebe