Page MenuHomePhabricator

[clangd] Implement proximity path boosting for Dex

Authored by kbobyrev on Aug 30 2018, 3:07 AM.



This patch introduces PathURI Search Token kind and utilizes it to uprank symbols which are defined in files with small distance to the directory where the fuzzy find request is coming from (e.g. files user is editing).

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ioeric added inline comments.Sep 3 2018, 5:32 AM
143 ↗(On Diff #163680)

x in std::exp(x) should be negated so that the range is (0, 1]. And BoostFactor should be 1+(0,1].

I'm not sure if dividing by UpCost is correct here (and similarly in Quality.cpp), because you would want lower score for larger UpCost?

137 ↗(On Diff #163538)

The FIXME is now outdated?

36 ↗(On Diff #163680)

For the original URI, we could simply add it to the result outside of the loop (and --Limit) to save one iteration?

54 ↗(On Diff #163680)

nit: just if (!URI)

64 ↗(On Diff #163680)

nit: not necessarily *all* parents right?

65 ↗(On Diff #163680)

Maybe just URI? Or ProximityURI?

96 ↗(On Diff #163680)

nit: s/Path/URIStr/ for the argument.
nit: Just generateProximityURIs? Same below.

443 ↗(On Diff #163680)

This doesn't seem to work on windows?

620 ↗(On Diff #163680)

It's unclear what this is testing. Intuitively, you would want a smoke test that checks a symbol with better proximity is ranked higher (when all other signals are the same).

653 ↗(On Diff #163680)

again, watch out for windows when hard-coding paths :)

67 ↗(On Diff #163680)

Why this change?

kbobyrev updated this revision to Diff 163773.Sep 4 2018, 3:05 AM
kbobyrev marked 12 inline comments as done.
  • Rebase on top of the parent patch
  • Apply many refactorings where appropriate
  • Write more comments and documentation
  • Abstract pieces of code which are shared by multiple pieces of Clangd
ioeric added a comment.Sep 4 2018, 7:51 AM

There are a few changes/refactorings which I would suggest splitting from this patch, as they would require more thoughts and seem unrelated in this patch. Please see the inline comments :)

130 ↗(On Diff #163773)

This should take in a FileDistanceOptions to allow customized up/down costs. To support general cases, this should probably take UpDistance, DownDistance, and FileDistanceOptions.

130 ↗(On Diff #163773)

Not sure if this is the right name. Two paths don't need to be parent and child (if we also consider down cost). Maybe just proximityScore?

42 ↗(On Diff #163773)

Maybe ScoredSymbol? Is this used outside of buildIndex? If not, maybe just inline it? Or simply use a std::pair?

59 ↗(On Diff #163773)

We should fix quality to return float for consistency.

59 ↗(On Diff #163773)

nit: Instead of reconstruct the structure, I'd probably set the fields manually.

65 ↗(On Diff #163773)

Is the operator> overload used anywhere besides here? If not, I'd suggest inlining the overload.

129 ↗(On Diff #163773)

IIUC, the assumption here is that generateQueryProximityURIs returns proximity tokens of parent directories sorted by distance. This seems to be making assumption about the implementation. generateQueryProximityURIs should return the distance explicitly along with the tokens.

176 ↗(On Diff #163773)

Just inline the comparator for clarity.

179 ↗(On Diff #163773)

This is adding another staging of filtering. It seems to be an optimization that's not specific to the proximity path change and would probably require more careful review. Can we split this out of this patch for now?

80 ↗(On Diff #163773)

nit: Sorted in the descending order of symbol quality.

83 ↗(On Diff #163773)

nit: SymbolQuality[I] is the quality of Symbols[I].

139 ↗(On Diff #163773)

I'm not sure about this refactoring. It seems to be out of the scope. Can we split it out and still return pair for now?

36 ↗(On Diff #163680)

nit: (--Limit > 0) for clarity.

100 ↗(On Diff #163773)

As mentioned in another comment, this should return distance along with tokens (instead of assuming consecutive parents and distances).

ioeric requested changes to this revision.Sep 4 2018, 7:52 AM

(and forgot to request changes ;)

This revision now requires changes to proceed.Sep 4 2018, 7:52 AM
kbobyrev updated this revision to Diff 163987.Sep 5 2018, 2:30 AM
kbobyrev marked 14 inline comments as done.
  • Rebase on top of master
  • Address suggestions Eric wrote here
  • Address suggestions from the offline discussion by reusing Quality.cpp infrastructure for path proximity boosting calculation
ioeric added inline comments.Sep 5 2018, 3:28 AM
59 ↗(On Diff #163773)

We shouldn't need the static_cast<float> anymore?

55 ↗(On Diff #163987)

This sorts in ascending order by default?

118 ↗(On Diff #163987)

As chatted offline, we only need a single URIDistacne structure for all proximity paths. And we could also deduplicate parent directories.

generateQueryPathProximityTokens should return URIs instead of tokens as the token data doesn't need to be raw URIs (e.g. hash). And the function should probably live in DexIndex.cpp instead of Token.cpp.

I'd also suggest pulling this code into a helper.

159 ↗(On Diff #163987)

I think using IDAndScore = decltype(IDAndScores.fron()) might help here.

287 ↗(On Diff #163987)

This could take Opts by reference.

ioeric requested changes to this revision.Sep 5 2018, 4:43 AM
This revision now requires changes to proceed.Sep 5 2018, 4:43 AM
kbobyrev updated this revision to Diff 164023.Sep 5 2018, 6:03 AM
kbobyrev marked 5 inline comments as done.

Address another round of comments.

kbobyrev updated this revision to Diff 164024.Sep 5 2018, 6:06 AM

Keep 2 minor refactorings out of the scope of this patch.

ioeric requested changes to this revision.Sep 5 2018, 7:23 AM

This should be the last round! ;)

45 ↗(On Diff #164024)

To be clearer, maybe createFileProximityIterators?

50 ↗(On Diff #164024)

nit: just llvm::StringSet<>?

s/UniqueURIs/ParentURIs/? As this is a set, we don't need to be explicit about uniqueness here.

s/FuzzyFindRequest/ProximityPaths (in comment). FuzzyFindRequest shouldn't be relevant in this function, and we should avoid mentioning it here. same below.

51 ↗(On Diff #164024)

nit: maybe // Use ProximityPaths as proximity sources?

55 ↗(On Diff #164024)

I'd double check PathProximityURIs is not empty.

62 ↗(On Diff #164024)

nit: this is symbol *relevance* evaluation

77 ↗(On Diff #164024)

evaluate() should return an appropriate boosting score (<1 for down-boosting and >1 for up-boosting), so there is no need to plus 1 here.

182 ↗(On Diff #164024)

const IDAndScore&?

244 ↗(On Diff #164024)

nit: Scheme and Authority are both used once. I'd just inline them.

246 ↗(On Diff #164024)

nit: I'd call this Body for clarity.

98 ↗(On Diff #164024)

nit: s/Path/URIPath/

98 ↗(On Diff #164024)

nit: maybe mention that these functions are exposed for testing only?

104 ↗(On Diff #164024)

How about changing this one to just convert AbsolutePath to a URI of the preferred scheme and having the users call generateProximityURIs? This would simply the contract. And it can probably be a library function (e.g. makePreferredURI(AbsPath, URISchemes)) in URI.h.

430 ↗(On Diff #164024)

Could you add a case where generateProximityURIs generates fewer than 5 proximity URIs?

601 ↗(On Diff #164024)

Could you add a helper (e.g. lambda) for creating the symbol? The code looks almost the same.

Alternatively, this can be:

auto Sym1 = symbol("na::X");
Sym1.CanonicalDeclaration.FileURI = "unittest:///...";
auto Sym2 = symbol("nb::X");
Sym1.CanonicalDeclaration.FileURI = "unittest:///..."

You can simply spell out the URI paths because they are platform agnostic.

623 ↗(On Diff #164024)

We could use the constructor that doesn't take ownership e.g. DexIndex({Sym1, Sym2}, URISchemes)

627 ↗(On Diff #164024)

nit: or // The best candidate can change depending on the proximity paths.?

636 ↗(On Diff #164024)

This is not going to work on windows. You would need something like:

append(RequestPath, "a")
append(RequestPath, "b")

And you probably don't need a very deep path.

638 ↗(On Diff #164024)

I think matchDeclURIs is not necessary as we could use different qualified names?

45 ↗(On Diff #164024)

(and this shouldn't be necessary)

This revision now requires changes to proceed.Sep 5 2018, 7:23 AM
kbobyrev updated this revision to Diff 164062.Sep 5 2018, 10:13 AM
kbobyrev marked 18 inline comments as done.

Address a round of comments

kbobyrev added inline comments.Sep 5 2018, 12:17 PM
623 ↗(On Diff #164024)

That would try to convert {Sym1, Sym2} to SymbolSlab and that's not possible. IIUC your suggestion is about using the constructor with ranges:

template <typename Range>
DexIndex(Range &&Symbols, llvm::ArrayRef<std::string> URISchemes)

I could use llvm::make_range(begin(Symbols), end(Symbols)), but then I'd have to put everything into Symbols container first and that might be the same amount of code. I'm not sure that would be better than having a Builder. Did I miss something?

ioeric accepted this revision.Sep 5 2018, 12:39 PM
ioeric added inline comments.
200 ↗(On Diff #164062)

maybe also include the schemes in the error message. with llvm::join?

48 ↗(On Diff #164062)

nit: // Similar to above except this uses the first scheme in \p Schemes that works.

51 ↗(On Diff #164062)

this comment is no longer true.

52 ↗(On Diff #164062)

These are not URIs anymore. Consider calling it Sources?

55 ↗(On Diff #164062)

Result of URI::create must be explicitly checked; it would trigger assertion if it is not checked or contains error.

58 ↗(On Diff #164062)

We should still add Path to RequestURIs even if there is no proximity URI.

65 ↗(On Diff #164062)

nit: s/request URI/ProximityPaths/

242 ↗(On Diff #164062)

I'd suggesting avoiding assertion here. Consider elog the error and returning empty result.

395 ↗(On Diff #164062)

Why do we need this change?

278 ↗(On Diff #164062)

(Have you rebased on HEAD? The code has been changed.)

623 ↗(On Diff #164024)

Have you tried std::vector<Symbol>{Sym1, Sym2}? or push_back twice? Using Builder seems unnecessary but not too bad.

This revision is now accepted and ready to land.Sep 5 2018, 12:39 PM
kbobyrev updated this revision to Diff 164172.Sep 6 2018, 2:19 AM
kbobyrev marked 10 inline comments as done.

This should address the last round of comments.

There was still some hassle with the SymbolSlab::Builder in the unit tests and I decided to leave it as it is due to some weird ownership semantics introduced by vector of Symbols.

kbobyrev updated this revision to Diff 164176.Sep 6 2018, 2:55 AM
kbobyrev marked 3 inline comments as done.

Resolve the issue with SymbolSlab::Builder and make sure passed vector has correct lifetime.

This revision was automatically updated to reflect the committed changes.