This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Implement proximity path boosting for Dex
ClosedPublic

Authored by kbobyrev on Aug 30 2018, 3:07 AM.

Details

Summary

This patch introduces PathURI Search Token kind and utilizes it to uprank symbols which are defined in files with small distance to the directory where the fuzzy find request is coming from (e.g. files user is editing).

Diff Detail

Event Timeline

kbobyrev created this revision.Aug 30 2018, 3:07 AM
kbobyrev updated this revision to Diff 163294.Aug 30 2018, 3:41 AM

Stop query generation as soon as one valid URI scheme was found.

Some high-level comments :)

clang-tools-extra/clangd/index/dex/DexIndex.h
45 ↗(On Diff #163294)

URI schemes are property of Symbols. We shouldn't need to pass specify the schemes again. Dex can collect all possible URI schemes when building the index.

I think we could generate proximity paths for index symbols without knowing about schemes (when building the index). For example, we could canonicalize the URI (as in FileDistance.cpp), for example, by converting test:///x/y/z to /test:/x/y/z. For a query, we would first convert query proximity paths to URIs (of all possible schemes), perform the same canonicalization on URIs, and then use the canonicalized paths to compute proximity.

64 ↗(On Diff #163294)

Why are values of multipliers interesting to users? Could these be implementation details in the cpp file?

clang-tools-extra/clangd/index/dex/Token.h
99 ↗(On Diff #163294)

Note that boost != up traversal distance in most cases. We could return the distance here and let users decide what the exact boost score is.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
33 ↗(On Diff #163294)

Use unittest scheme defined in TestFS.cpp

sammccall added inline comments.Aug 30 2018, 5:05 AM
clang-tools-extra/clangd/index/dex/DexIndex.h
45 ↗(On Diff #163294)

We had an offline discussion about URIs which might be interesting.

Fetching posting lists for all URI schemes at query time seems wasteful, @ilya-biryukov pointed out that in practice each file is going to exist in some preferred URI space, and we should only compare that one.
The only obstacle to doing that is that the preference order for URI schemes is not global, each place we need it it's accepted as configuration.

Maybe we should change that: as part of registering the URI schemes, we define a preferred order. And instead of the current operation makeURI(File, scheme) we should just offer APIs like makePreferredURI(File) and makeFileURI(File).

ioeric added inline comments.Aug 30 2018, 5:49 AM
clang-tools-extra/clangd/index/dex/DexIndex.h
45 ↗(On Diff #163294)

That sounds like a good idea.

We just need to be careful that indexes do not use non-preferred schemes (e.g. standalone indexer doesn't have the preferred scheme registered). This shouldn't be a problem in practice though.

kbobyrev updated this revision to Diff 163332.Aug 30 2018, 7:59 AM
kbobyrev marked 2 inline comments as done.

Address couple of comments.

clang-tools-extra/clangd/index/dex/DexIndex.h
45 ↗(On Diff #163294)

I also thought about extracting schemes during the build stage earlier today, couldn't figure out whether it's a good thing: it's probably tiny overhead for the build stage (if we just pass schemes the server is aware of we avoid looking for new scheme: in the prefixes), but it might be worth considering the architecture (and the fact that even though the server might be aware of many schemes, some of them might not occur in the index and that would induce small overhead on the query side).

Could you please elaborate your suggestion about the canonicalize? I'm not sure I caught that: IIUC the current approach (which doesn't canonicalize URIs) should be sufficient for our needs. I'm not sure why we would want /test:/x/y/z/ over test:///x/y/z/ if the extracted tokens are test:///, test:///x/, test:///x/y/, test:///x/y/z/ (which is the case right now).

64 ↗(On Diff #163294)

Actually, my understanding is that users might want to have full access to the multipliers at some point to control the performance/quality ratio.

And it's also useful for the tests: otherwise the last one would have to hard-code number of generated symbols to ensure only boosted ones are in the returned list. It would have to be updated each time these internal multipliers are and we might update them often/make logic less clear (by allowing users to control these parameters).

What do you think?

kbobyrev updated this revision to Diff 163507.Aug 31 2018, 4:35 AM

Canonicalize URIs, slightly simplify code structure.

ioeric added inline comments.Aug 31 2018, 6:47 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
137 ↗(On Diff #163507)

Could you comment on P.second + 10 here? It sounds like a lot of boost.

157 ↗(On Diff #163507)

Again, the multiplier change seems irrelevant in this patch.

163 ↗(On Diff #163507)

Shouldn't we sort them by Quality * boost?

clang-tools-extra/clangd/index/dex/DexIndex.h
73 ↗(On Diff #163507)

Double private?

87 ↗(On Diff #163507)

I think URISchemes should be initialized when creating a DexIndex and remain the same. So this could be const without mutex guard.

64 ↗(On Diff #163294)

I'm not sure if users can usefully control this multipliers without understanding the whole implementation. Do we have a use case for these APIs now? If not, I'd suggesting removing them from this patch. It also seems to be out of the scope of this patch.

clang-tools-extra/clangd/index/dex/Token.h
54 ↗(On Diff #163507)

As this is called Path, I'd try to decouple it from URIs, so that we don't need special handling of scheme etc in the implementation. We might want to canonicalize URI to "path" like /file:/path/to/something/ so that we could simply treat it as normal paths, and the token could potentially handle actual actual file paths.

97 ↗(On Diff #163507)

I couldn't find implementations of these two functions in this patch?

kbobyrev updated this revision to Diff 163537.Aug 31 2018, 8:31 AM
kbobyrev marked 14 inline comments as done.
kbobyrev updated this revision to Diff 163538.Aug 31 2018, 8:36 AM
kbobyrev marked an inline comment as not done.

Fix tests

clang-tools-extra/clangd/index/dex/Token.h
54 ↗(On Diff #163507)

Should probably rename it to URI and be more explicit about its nature.

ioeric added inline comments.Sep 3 2018, 1:36 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
36 ↗(On Diff #163538)

nit: no need for braces. [llvm-style]

126 ↗(On Diff #163538)

I thought we wanted to take the first scheme that works, given that URISchemes is sorted by preference?

137 ↗(On Diff #163538)

PARENTS_TO_BOOST + 1 - P.second seems to be too much boosting. We should align the boosting function with the one we use in clangd/Quality.cpp, e.g. proximity score should be would be in the range [0, 1], and we can boost by 1+proximity.

161 ↗(On Diff #163538)

nit: in general, this should be a combination of relevance score (e.g. proximity) and quality score (e.g. #references).

169 ↗(On Diff #163538)

nit: SymbolQuality[(*Symbols)[LHS.first]]?

For performance reason, could we instead make SymbolQuality a vector indexed by DocID?

clang-tools-extra/clangd/index/dex/DexIndex.h
42 ↗(On Diff #163538)

nit: explicit?

42 ↗(On Diff #163538)

Add documentation about URISchemes?

clang-tools-extra/clangd/index/dex/Token.cpp
18 ↗(On Diff #163538)

Add FIXME mentioning that the parsing and encoding here are not necessary and could potentially be optimized.

28 ↗(On Diff #163538)

nit: ParsedURI->scheme();

32 ↗(On Diff #163538)

I think you need to set posix style explicitly here. On windows, the separator is '/'.

32 ↗(On Diff #163538)

I think the original path should also be used as a proximity path.

51 ↗(On Diff #163538)

You could use llvm::consumeError(URI.takeError()).

clang-tools-extra/clangd/index/dex/Token.h
45 ↗(On Diff #163538)

nit: For example, .... (URI is an good example I think)

58 ↗(On Diff #163538)

Do we also consider the original URI as a proximity path?

94 ↗(On Diff #163538)

nit: also explain how Limit is used? e.g. closer parent directories are preferred?

96 ↗(On Diff #163538)

As we are assuming that Path is an URI. We should be more explicit in the function names.

104 ↗(On Diff #163538)

I'm wondering whether we should make Count customizable or keep it as implementation detail. generateProximityPaths and generateQueryProximityPaths should have the same Limit/Count in practice, and I'm not sure if users could usefully customize the value.

kbobyrev updated this revision to Diff 163667.Sep 3 2018, 1:41 AM
kbobyrev marked 2 inline comments as done.
kbobyrev edited the summary of this revision. (Show Details)

Rebase on top of master, s/Path/PathURI/g to be more explicit about the token contents and origin.

kbobyrev updated this revision to Diff 163680.Sep 3 2018, 3:26 AM
kbobyrev marked 17 inline comments as done.

Address a round of comments, refactor code.

ioeric added inline comments.Sep 3 2018, 5:32 AM
clang-tools-extra/;
1 ↗(On Diff #163680)

Unintended change?

ioeric added inline comments.Sep 3 2018, 5:32 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
137 ↗(On Diff #163538)

The FIXME is now outdated?

143 ↗(On Diff #163680)

x in std::exp(x) should be negated so that the range is (0, 1]. And BoostFactor should be 1+(0,1].

I'm not sure if dividing by UpCost is correct here (and similarly in Quality.cpp), because you would want lower score for larger UpCost?

clang-tools-extra/clangd/index/dex/Token.cpp
36 ↗(On Diff #163680)

For the original URI, we could simply add it to the result outside of the loop (and --Limit) to save one iteration?

54 ↗(On Diff #163680)

nit: just if (!URI)

clang-tools-extra/clangd/index/dex/Token.h
64 ↗(On Diff #163680)

nit: not necessarily *all* parents right?

65 ↗(On Diff #163680)

Maybe just URI? Or ProximityURI?

96 ↗(On Diff #163680)

nit: s/Path/URIStr/ for the argument.
nit: Just generateProximityURIs? Same below.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
443 ↗(On Diff #163680)

This doesn't seem to work on windows?

620 ↗(On Diff #163680)

It's unclear what this is testing. Intuitively, you would want a smoke test that checks a symbol with better proximity is ranked higher (when all other signals are the same).

653 ↗(On Diff #163680)

again, watch out for windows when hard-coding paths :)

clang-tools-extra/unittests/clangd/TestFS.cpp
67 ↗(On Diff #163680)

Why this change?

kbobyrev updated this revision to Diff 163773.Sep 4 2018, 3:05 AM
kbobyrev marked 12 inline comments as done.
  • Rebase on top of the parent patch
  • Apply many refactorings where appropriate
  • Write more comments and documentation
  • Abstract pieces of code which are shared by multiple pieces of Clangd
ioeric added a comment.Sep 4 2018, 7:51 AM

There are a few changes/refactorings which I would suggest splitting from this patch, as they would require more thoughts and seem unrelated in this patch. Please see the inline comments :)

clang-tools-extra/clangd/Quality.h
130 ↗(On Diff #163773)

This should take in a FileDistanceOptions to allow customized up/down costs. To support general cases, this should probably take UpDistance, DownDistance, and FileDistanceOptions.

130 ↗(On Diff #163773)

Not sure if this is the right name. Two paths don't need to be parent and child (if we also consider down cost). Maybe just proximityScore?

clang-tools-extra/clangd/index/dex/DexIndex.cpp
42 ↗(On Diff #163773)

Maybe ScoredSymbol? Is this used outside of buildIndex? If not, maybe just inline it? Or simply use a std::pair?

59 ↗(On Diff #163773)

We should fix quality to return float for consistency.

59 ↗(On Diff #163773)

nit: Instead of reconstruct the structure, I'd probably set the fields manually.

65 ↗(On Diff #163773)

Is the operator> overload used anywhere besides here? If not, I'd suggest inlining the overload.

129 ↗(On Diff #163773)

IIUC, the assumption here is that generateQueryProximityURIs returns proximity tokens of parent directories sorted by distance. This seems to be making assumption about the implementation. generateQueryProximityURIs should return the distance explicitly along with the tokens.

176 ↗(On Diff #163773)

Just inline the comparator for clarity.

179 ↗(On Diff #163773)

This is adding another staging of filtering. It seems to be an optimization that's not specific to the proximity path change and would probably require more careful review. Can we split this out of this patch for now?

clang-tools-extra/clangd/index/dex/DexIndex.h
80 ↗(On Diff #163773)

nit: Sorted in the descending order of symbol quality.

83 ↗(On Diff #163773)

nit: SymbolQuality[I] is the quality of Symbols[I].

clang-tools-extra/clangd/index/dex/Iterator.h
139 ↗(On Diff #163773)

I'm not sure about this refactoring. It seems to be out of the scope. Can we split it out and still return pair for now?

clang-tools-extra/clangd/index/dex/Token.cpp
36 ↗(On Diff #163680)

nit: (--Limit > 0) for clarity.

clang-tools-extra/clangd/index/dex/Token.h
100 ↗(On Diff #163773)

As mentioned in another comment, this should return distance along with tokens (instead of assuming consecutive parents and distances).

ioeric requested changes to this revision.Sep 4 2018, 7:52 AM

(and forgot to request changes ;)

This revision now requires changes to proceed.Sep 4 2018, 7:52 AM
kbobyrev updated this revision to Diff 163987.Sep 5 2018, 2:30 AM
kbobyrev marked 14 inline comments as done.
  • Rebase on top of master
  • Address suggestions Eric wrote here
  • Address suggestions from the offline discussion by reusing Quality.cpp infrastructure for path proximity boosting calculation
ioeric added inline comments.Sep 5 2018, 3:28 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
55 ↗(On Diff #163987)

This sorts in ascending order by default?

118 ↗(On Diff #163987)

As chatted offline, we only need a single URIDistacne structure for all proximity paths. And we could also deduplicate parent directories.

generateQueryPathProximityTokens should return URIs instead of tokens as the token data doesn't need to be raw URIs (e.g. hash). And the function should probably live in DexIndex.cpp instead of Token.cpp.

I'd also suggest pulling this code into a helper.

159 ↗(On Diff #163987)

I think using IDAndScore = decltype(IDAndScores.fron()) might help here.

59 ↗(On Diff #163773)

We shouldn't need the static_cast<float> anymore?

clang-tools-extra/clangd/tool/ClangdMain.cpp
287 ↗(On Diff #163987)

This could take Opts by reference.

ioeric requested changes to this revision.Sep 5 2018, 4:43 AM
This revision now requires changes to proceed.Sep 5 2018, 4:43 AM
kbobyrev updated this revision to Diff 164023.Sep 5 2018, 6:03 AM
kbobyrev marked 5 inline comments as done.

Address another round of comments.

kbobyrev updated this revision to Diff 164024.Sep 5 2018, 6:06 AM

Keep 2 minor refactorings out of the scope of this patch.

ioeric requested changes to this revision.Sep 5 2018, 7:23 AM

This should be the last round! ;)

clang-tools-extra/clangd/index/dex/DexIndex.cpp
45 ↗(On Diff #164024)

To be clearer, maybe createFileProximityIterators?

50 ↗(On Diff #164024)

nit: just llvm::StringSet<>?

s/UniqueURIs/ParentURIs/? As this is a set, we don't need to be explicit about uniqueness here.

s/FuzzyFindRequest/ProximityPaths (in comment). FuzzyFindRequest shouldn't be relevant in this function, and we should avoid mentioning it here. same below.

51 ↗(On Diff #164024)

nit: maybe // Use ProximityPaths as proximity sources?

55 ↗(On Diff #164024)

I'd double check PathProximityURIs is not empty.

62 ↗(On Diff #164024)

nit: this is symbol *relevance* evaluation

77 ↗(On Diff #164024)

evaluate() should return an appropriate boosting score (<1 for down-boosting and >1 for up-boosting), so there is no need to plus 1 here.

182 ↗(On Diff #164024)

const IDAndScore&?

244 ↗(On Diff #164024)

nit: Scheme and Authority are both used once. I'd just inline them.

246 ↗(On Diff #164024)

nit: I'd call this Body for clarity.

clang-tools-extra/clangd/index/dex/DexIndex.h
98 ↗(On Diff #164024)

nit: s/Path/URIPath/

98 ↗(On Diff #164024)

nit: maybe mention that these functions are exposed for testing only?

104 ↗(On Diff #164024)

How about changing this one to just convert AbsolutePath to a URI of the preferred scheme and having the users call generateProximityURIs? This would simply the contract. And it can probably be a library function (e.g. makePreferredURI(AbsPath, URISchemes)) in URI.h.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
430 ↗(On Diff #164024)

Could you add a case where generateProximityURIs generates fewer than 5 proximity URIs?

601 ↗(On Diff #164024)

Could you add a helper (e.g. lambda) for creating the symbol? The code looks almost the same.

Alternatively, this can be:

auto Sym1 = symbol("na::X");
Sym1.CanonicalDeclaration.FileURI = "unittest:///...";
auto Sym2 = symbol("nb::X");
Sym1.CanonicalDeclaration.FileURI = "unittest:///..."

You can simply spell out the URI paths because they are platform agnostic.

623 ↗(On Diff #164024)

We could use the constructor that doesn't take ownership e.g. DexIndex({Sym1, Sym2}, URISchemes)

627 ↗(On Diff #164024)

nit: or // The best candidate can change depending on the proximity paths.?

636 ↗(On Diff #164024)

This is not going to work on windows. You would need something like:

append(RequestPath, "a")
append(RequestPath, "b")

And you probably don't need a very deep path.

638 ↗(On Diff #164024)

I think matchDeclURIs is not necessary as we could use different qualified names?

clang-tools-extra/unittests/clangd/TestIndex.h
45 ↗(On Diff #164024)

(and this shouldn't be necessary)

This revision now requires changes to proceed.Sep 5 2018, 7:23 AM
kbobyrev updated this revision to Diff 164062.Sep 5 2018, 10:13 AM
kbobyrev marked 18 inline comments as done.

Address a round of comments

kbobyrev added inline comments.Sep 5 2018, 12:17 PM
clang-tools-extra/unittests/clangd/DexIndexTests.cpp
623 ↗(On Diff #164024)

That would try to convert {Sym1, Sym2} to SymbolSlab and that's not possible. IIUC your suggestion is about using the constructor with ranges:

template <typename Range>
DexIndex(Range &&Symbols, llvm::ArrayRef<std::string> URISchemes)

I could use llvm::make_range(begin(Symbols), end(Symbols)), but then I'd have to put everything into Symbols container first and that might be the same amount of code. I'm not sure that would be better than having a Builder. Did I miss something?

ioeric accepted this revision.Sep 5 2018, 12:39 PM
ioeric added inline comments.
clang-tools-extra/clangd/URI.cpp
200 ↗(On Diff #164062)

maybe also include the schemes in the error message. with llvm::join?

clang-tools-extra/clangd/URI.h
48 ↗(On Diff #164062)

nit: // Similar to above except this uses the first scheme in \p Schemes that works.

clang-tools-extra/clangd/index/dex/DexIndex.cpp
51 ↗(On Diff #164062)

this comment is no longer true.

52 ↗(On Diff #164062)

These are not URIs anymore. Consider calling it Sources?

55 ↗(On Diff #164062)

Result of URI::create must be explicitly checked; it would trigger assertion if it is not checked or contains error.

58 ↗(On Diff #164062)

We should still add Path to RequestURIs even if there is no proximity URI.

65 ↗(On Diff #164062)

nit: s/request URI/ProximityPaths/

242 ↗(On Diff #164062)

I'd suggesting avoiding assertion here. Consider elog the error and returning empty result.

clang-tools-extra/clangd/index/dex/Iterator.cpp
395 ↗(On Diff #164062)

Why do we need this change?

clang-tools-extra/clangd/tool/ClangdMain.cpp
278 ↗(On Diff #164062)

(Have you rebased on HEAD? The code has been changed.)

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
623 ↗(On Diff #164024)

Have you tried std::vector<Symbol>{Sym1, Sym2}? or push_back twice? Using Builder seems unnecessary but not too bad.

This revision is now accepted and ready to land.Sep 5 2018, 12:39 PM
kbobyrev updated this revision to Diff 164172.Sep 6 2018, 2:19 AM
kbobyrev marked 10 inline comments as done.

This should address the last round of comments.

There was still some hassle with the SymbolSlab::Builder in the unit tests and I decided to leave it as it is due to some weird ownership semantics introduced by vector of Symbols.

kbobyrev updated this revision to Diff 164176.Sep 6 2018, 2:55 AM
kbobyrev marked 3 inline comments as done.

Resolve the issue with SymbolSlab::Builder and make sure passed vector has correct lifetime.

This revision was automatically updated to reflect the committed changes.