Download Raw Diff

Details

Reviewers

sammccall
kadircet
adamcz

Summary

As @kadircet mentions in D84912#2184144, findNearbyIdentifier() traverses the whole file if there is no identifier for the word.
This patch ensures give up after 2^N lines in any case.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ArcsinX created this revision.Sep 18 2020, 1:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 18 2020, 1:55 AM

Herald added subscribers: cfe-commits, usaxena95, arphaman, jkorous. · View Herald Transcript

ArcsinX requested review of this revision.Sep 18 2020, 1:55 AM

Herald added subscribers: MaskRay, ilya-biryukov. · View Herald TranscriptSep 18 2020, 1:55 AM

ArcsinX added reviewers: sammccall, kadircet, adamcz.Sep 18 2020, 1:56 AM

Fix format

Harbormaster completed remote builds in B72153: Diff 292739.Sep 18 2020, 3:44 AM

std::pow => bitwise shift.
Take care about integers overflow.

Harbormaster completed remote builds in B72630: Diff 293665.Sep 23 2020, 1:28 AM

I feel like I'm doing something totally wrong here :)
Could someone give me an advice?

Hey! Sorry for the late reply, this has been open in my tabs since day 1 just didn't get a chance to take a look at it.

The biggest problem I see is, this is not changing the fact that we are still traversing the whole file:

You do one traversal over the whole file to find FileLines
Then possibly two more to find OffsetMin and OffsetMax

You can get rid of the first one by just using getLocForEndOfFile if positionToOffset fails.
For the latter, you can use SourceManager::translateLineCol instead, it uses a cache and cheaper than the linear scan performed by positionToOffset in case of multiple calls. The cache is already populated once we issue the first Cost call, through SM.getSpellingLineNumber

I was reluctant overall as the wins aren't clear. We are still going to traverse the whole file at least once, and readability is hindered but I suppose with the above mentioned two trics, runtime shouldn't be regressed.
Also it is nice to get rid of 2 log calls for each token. Again all of these feels like some micro optimizations that aren't justified.

sammccall added inline comments.Sep 23 2020, 11:52 PM

clang-tools-extra/clangd/XRefs.cpp
564–565	Since costs are only compared, we can simplify this: return (Line >= WordLine) ? (Line - WordLine) : 2 * (WordLine - Line)
564–565	this has changed the relative ordering, if we're dropping the log then +1 should now become multiplication by two. (Or was this intended?)
568–586	The initial value of BestCost was chosen so that no-result would be worse than any cost in the eligible range, but better than any cost outside it. If you're going to truncate the search region, you can just initialize to -1 (i.e. max). (FWIW, I guess `sizeof(unsigned) * 8 - 1` -> `std::numeric_limits<unsigned>::digits - 1` would be more obvious)
570	I think simplest is to use SourceMgr's line table. int MinLine = (signed)Line - Word.Text.size()/2 , MaxLine = Line + Word.Text.size(); SourceLocation LocMin = SM.translateLineCol(File, std::max(1, MinLine), 1); SourceLocation LocMax = SM.translateLineCol(File, MaxLine); // past-end ok
580	why can't this happen anymore?

In D87891#2291760, @kadircet wrote:

The biggest problem I see is, this is not changing the fact that we are still traversing the whole file:

You do one traversal over the whole file to find FileLines

Then possibly two more to find OffsetMin and OffsetMax

Seems you are right, but we do not compare strings during traversal to find FileLines.

You can get rid of the first one by just using getLocForEndOfFile if positionToOffset fails.
For the latter, you can use SourceManager::translateLineCol instead, it uses a cache and cheaper than the linear scan performed by positionToOffset in case of multiple calls. The cache is already populated once we issue the first Cost call, through SM.getSpellingLineNumber

I was reluctant overall as the wins aren't clear. We are still going to traverse the whole file at least once, and readability is hindered but I suppose with the above mentioned two trics, runtime shouldn't be regressed.
Also it is nice to get rid of 2 log calls for each token. Again all of these feels like some micro optimizations that aren't justified.

Thanks for your reply, I will rethink this patch.

clang-tools-extra/clangd/XRefs.cpp
564–565	Yes, you are right, my fault here. But should we penalize backwards so hard?
580	As I understand, we call `findNearbyIdentifier()` only when the word is not an identifier. And `Tok` is always identifier here. Btw, even if `findNearbyIdentifier()` could be called for an identifier, I could not get why it's bad to return current position here. Am I wrong?

In D87891#2291838, @ArcsinX wrote:

Thanks for your reply, I will rethink this patch.

FWIW I think if we drop most of the math in favor of using SourceManager's line translation, this doesn't add much complexity and is probably a bit more direct and efficient.

clang-tools-extra/clangd/XRefs.cpp
580	As I understand, we call findNearbyIdentifier() only when the word is not an identifier I'm not sure where you're getting that. If you mean the bail-out inside the function itself: // Don't use heuristics if this is a real identifier. // Unlikely identifiers are OK if they were used as identifiers nearby. if (Word.ExpandedToken) return nullptr; then "real identifier" here means it's an identifier after preprocessing. We're still getting here if it's e.g. part of a macro body, or code that's `#ifdef`'d out. The spelled token in those cases is an identifier, there's just no corresponding expanded token. (I'm surprised we don't have a test covering this though!) Btw, even if findNearbyIdentifier() could be called for an identifier, I could not get why it's bad to return current position here. The point of this function is that this occurrence of the word can't be resolved, so let's try another one nearby. If we just return the same occurrence, then we're certainly not going to be able to resolve it.

ArcsinX added inline comments.Sep 24 2020, 1:30 AM

clang-tools-extra/clangd/XRefs.cpp
580	Thanks for your clarification. I will revert this change (and will try to add a test as well).

Use SourceManager::translateLineCol(), code simplifications.

Harbormaster completed remote builds in B73126: Diff 294616.Sep 28 2020, 12:31 AM

ArcsinX added inline comments.Sep 28 2020, 12:31 AM

clang-tools-extra/clangd/XRefs.cpp
580	I failed to create a test for this. I can create a test for identifiers in a macro body or under `#ifdef`, but such a test passes without `if (Tok.location() == Word.Location)` because of this check: if (!(tokenSpelledAt(Tok.location(), TB) \|\| TB.expansionStartingAt(&Tok))) return false;
589	here

ArcsinX marked 5 inline comments as done.Sep 28 2020, 11:59 PM

sammccall accepted this revision.Sep 29 2020, 12:50 AM

sammccall added inline comments.

clang-tools-extra/clangd/XRefs.cpp
572	std::min(Word.Text.size(), numeric_limits<unsigned>::digits() - 1) to avoid UB :-(
572	name WordGain is unclear to me: MaxDistance?
574	cares about -> can handle?
578	I think this is backwards: min should divide wordgain by two, max should not?

This revision is now accepted and ready to land.Sep 29 2020, 12:50 AM

Fix possible UB at bitwise shift.
WordGain => MaxDistance.
Fix LineMin and LineMax values.
Fix comment.

ArcsinX marked 4 inline comments as done.Sep 29 2020, 3:30 AM

Harbormaster completed remote builds in B73308: Diff 294924.Sep 29 2020, 3:35 AM

Don't know why this didn't close automatically
Commit: https://reviews.llvm.org/rGd8ba6b4ab3eceb6bbcdf4371d4ffaab9d1a5cebe

Diff 294924

clang-tools-extra/clangd/XRefs.cpp

Show First 20 Lines • Show All 555 Lines • ▼ Show 20 Lines	const syntax::Token *findNearbyIdentifier(const SpelledWord &Word,

const SourceManager &SM = TB.sourceManager();		const SourceManager &SM = TB.sourceManager();
// We prefer the closest possible token, line-wise. Backwards is penalized.		// We prefer the closest possible token, line-wise. Backwards is penalized.
// Ties are implicitly broken by traversal order (first-one-wins).		// Ties are implicitly broken by traversal order (first-one-wins).
auto File = SM.getFileID(Word.Location);		auto File = SM.getFileID(Word.Location);
unsigned WordLine = SM.getSpellingLineNumber(Word.Location);		unsigned WordLine = SM.getSpellingLineNumber(Word.Location);
auto Cost = [&](SourceLocation Loc) -> unsigned {		auto Cost = [&](SourceLocation Loc) -> unsigned {
assert(SM.getFileID(Loc) == File && "spelled token in wrong file?");		assert(SM.getFileID(Loc) == File && "spelled token in wrong file?");
unsigned Line = SM.getSpellingLineNumber(Loc);		unsigned Line = SM.getSpellingLineNumber(Loc);
if (Line > WordLine)		return Line >= WordLine ? Line - WordLine : 2 * (WordLine - Line);
		sammccallUnsubmitted Done Reply Inline Actions Since costs are only compared, we can simplify this: return (Line >= WordLine) ? (Line - WordLine) : 2 * (WordLine - Line) sammccall: Since costs are only compared, we can simplify this: return (Line >= WordLine) ? (Line…
		sammccallUnsubmitted Done Reply Inline Actions this has changed the relative ordering, if we're dropping the log then +1 should now become multiplication by two. (Or was this intended?) sammccall: this has changed the relative ordering, if we're dropping the log then +1 should now become…
		ArcsinXAuthorUnsubmitted Done Reply Inline Actions Yes, you are right, my fault here. But should we penalize backwards so hard? ArcsinX: Yes, you are right, my fault here. But should we penalize backwards so hard?
return 1 + llvm::Log2_64(Line - WordLine);
if (Line < WordLine)
return 2 + llvm::Log2_64(WordLine - Line);
return 0;
};		};
const syntax::Token *BestTok = nullptr;		const syntax::Token *BestTok = nullptr;
// Search bounds are based on word length: 2^N lines forward.		unsigned BestCost = -1;
unsigned BestCost = Word.Text.size() + 1;		// Search bounds are based on word length:
		// - forward: 2^N lines
		sammccallUnsubmitted Done Reply Inline Actions I think simplest is to use SourceMgr's line table. int MinLine = (signed)Line - Word.Text.size()/2 , MaxLine = Line + Word.Text.size(); SourceLocation LocMin = SM.translateLineCol(File, std::max(1, MinLine), 1); SourceLocation LocMax = SM.translateLineCol(File, MaxLine); // past-end ok sammccall: I think simplest is to use SourceMgr's line table. ``` int MinLine = (signed)Line - Word.Text.
		// - backward: 2^(N-1) lines.
		unsigned MaxDistance =
		sammccallUnsubmitted Done Reply Inline Actions std::min(Word.Text.size(), numeric_limits<unsigned>::digits() - 1) to avoid UB :-( sammccall: std::min(Word.Text.size(), numeric_limits<unsigned>::digits() - 1) to avoid UB :-(
		sammccallUnsubmitted Done Reply Inline Actions name WordGain is unclear to me: MaxDistance? sammccall: name WordGain is unclear to me: MaxDistance?
		1U << std::min<unsigned>(Word.Text.size(),
		std::numeric_limits<unsigned>::digits - 1);
		sammccallUnsubmitted Done Reply Inline Actions cares about -> can handle? sammccall: cares about -> can handle?
		// Line number for SM.translateLineCol() should be one-based, also
		// SM.translateLineCol() can handle line number greater than
		// number of lines in the file.
		// - LineMin = max(1, WordLine + 1 - 2^(N-1))
		sammccallUnsubmitted Done Reply Inline Actions I think this is backwards: min should divide wordgain by two, max should not? sammccall: I think this is backwards: min should divide wordgain by two, max should not?
		// - LineMax = WordLine + 1 + 2^N
		unsigned LineMin =
		WordLine + 1 <= MaxDistance / 2 ? 1 : WordLine + 1 - MaxDistance / 2;
		unsigned LineMax = WordLine + 1 + MaxDistance;
		SourceLocation LocMin = SM.translateLineCol(File, LineMin, 1);
		assert(LocMin.isValid());
		SourceLocation LocMax = SM.translateLineCol(File, LineMax, 1);
		assert(LocMax.isValid());
		sammccallUnsubmitted Done Reply Inline Actions The initial value of BestCost was chosen so that no-result would be worse than any cost in the eligible range, but better than any cost outside it. If you're going to truncate the search region, you can just initialize to -1 (i.e. max). (FWIW, I guess `sizeof(unsigned) * 8 - 1` -> `std::numeric_limits<unsigned>::digits - 1` would be more obvious) sammccall: The initial value of BestCost was chosen so that no-result would be worse than any cost in the…

// Updates BestTok and BestCost if Tok is a good candidate.		// Updates BestTok and BestCost if Tok is a good candidate.
// May return true if the cost is too high for this token.		// May return true if the cost is too high for this token.
auto Consider = [&](const syntax::Token &Tok) {		auto Consider = [&](const syntax::Token &Tok) {
		if (Tok.location() < LocMin \|\| Tok.location() > LocMax)
		return true; // we are too far from the word, break the outer loop.
if (!(Tok.kind() == tok::identifier && Tok.text(SM) == Word.Text))		if (!(Tok.kind() == tok::identifier && Tok.text(SM) == Word.Text))
return false;		return false;
// No point guessing the same location we started with.		// No point guessing the same location we started with.
sammccallUnsubmitted Done Reply Inline Actions why can't this happen anymore? sammccall: why can't this happen anymore?
ArcsinXAuthorUnsubmitted Done Reply Inline Actions As I understand, we call `findNearbyIdentifier()` only when the word is not an identifier. And `Tok` is always identifier here. Btw, even if `findNearbyIdentifier()` could be called for an identifier, I could not get why it's bad to return current position here. Am I wrong? ArcsinX: As I understand, we call `findNearbyIdentifier()` only when the word is not an identifier. And…
sammccallUnsubmitted Not Done Reply Inline Actions As I understand, we call findNearbyIdentifier() only when the word is not an identifier I'm not sure where you're getting that. If you mean the bail-out inside the function itself: // Don't use heuristics if this is a real identifier. // Unlikely identifiers are OK if they were used as identifiers nearby. if (Word.ExpandedToken) return nullptr; then "real identifier" here means it's an identifier after preprocessing. We're still getting here if it's e.g. part of a macro body, or code that's `#ifdef`'d out. The spelled token in those cases is an identifier, there's just no corresponding expanded token. (I'm surprised we don't have a test covering this though!) Btw, even if findNearbyIdentifier() could be called for an identifier, I could not get why it's bad to return current position here. The point of this function is that this occurrence of the word can't be resolved, so let's try another one nearby. If we just return the same occurrence, then we're certainly not going to be able to resolve it. sammccall: > As I understand, we call findNearbyIdentifier() only when the word is not an identifier I'm…
ArcsinXAuthorUnsubmitted Done Reply Inline Actions Thanks for your clarification. I will revert this change (and will try to add a test as well). ArcsinX: Thanks for your clarification. I will revert this change (and will try to add a test as well).
ArcsinXAuthorUnsubmitted Done Reply Inline Actions I failed to create a test for this. I can create a test for identifiers in a macro body or under `#ifdef`, but such a test passes without `if (Tok.location() == Word.Location)` because of this check: if (!(tokenSpelledAt(Tok.location(), TB) \|\| TB.expansionStartingAt(&Tok))) return false; ArcsinX: I failed to create a test for this. I can create a test for identifiers in a macro body or…
if (Tok.location() == Word.Location)		if (Tok.location() == Word.Location)
return false;		return false;
// We've done cheap checks, compute cost so we can break the caller's loop.		// We've done cheap checks, compute cost so we can break the caller's loop.
unsigned TokCost = Cost(Tok.location());		unsigned TokCost = Cost(Tok.location());
if (TokCost >= BestCost)		if (TokCost >= BestCost)
return true; // causes the outer loop to break.		return true; // causes the outer loop to break.
// Allow locations that might be part of the AST, and macros (even if empty)		// Allow locations that might be part of the AST, and macros (even if empty)
// but not things like disabled preprocessor sections.		// but not things like disabled preprocessor sections.
if (!(tokenSpelledAt(Tok.location(), TB) \|\| TB.expansionStartingAt(&Tok)))		if (!(tokenSpelledAt(Tok.location(), TB) \|\| TB.expansionStartingAt(&Tok)))
ArcsinXAuthorUnsubmitted Done Reply Inline Actions here ArcsinX: here
return false;		return false;
// We already verified this token is an improvement.		// We already verified this token is an improvement.
BestCost = TokCost;		BestCost = TokCost;
BestTok = &Tok;		BestTok = &Tok;
return false;		return false;
};		};
auto SpelledTokens = TB.spelledTokens(File);		auto SpelledTokens = TB.spelledTokens(File);
// Find where the word occurred in the token stream, to search forward & back.		// Find where the word occurred in the token stream, to search forward & back.
▲ Show 20 Lines • Show All 970 Lines • Show Last 20 Lines

clang-tools-extra/clangd/unittests/XRefsTests.cpp

Show First 20 Lines • Show All 1,422 Lines • ▼ Show 20 Lines	const char *Tests[] = {
)cpp",		)cpp",
R"cpp(		R"cpp(
// short identifiers don't find far results		// short identifiers don't find far results
int hi;		int hi;



// h^i		// h^i




		int x = hi;
)cpp",		)cpp",
R"cpp(		R"cpp(
// prefer nearest occurrence even if several matched tokens		// prefer nearest occurrence even if several matched tokens
// have the same value of `floor(log2(<token line> - <word line>))`.		// have the same value of `floor(log2(<token line> - <word line>))`.
int hello;		int hello;
int x = hello, y = hello;		int x = hello, y = hello;
int z = [[hello]];		int z = [[hello]];
// h^ello		// h^ello
▲ Show 20 Lines • Show All 466 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[clangd] findNearbyIdentifier(): guaranteed to give up after 2^N lines
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 294924

clang-tools-extra/clangd/XRefs.cpp

clang-tools-extra/clangd/unittests/XRefsTests.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[clangd] findNearbyIdentifier(): guaranteed to give up after 2^N linesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 294924

clang-tools-extra/clangd/XRefs.cpp

clang-tools-extra/clangd/unittests/XRefsTests.cpp

[clangd] findNearbyIdentifier(): guaranteed to give up after 2^N lines
ClosedPublic