This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Incorporate transitive #includes into code complete proximity scoring.
ClosedPublic

Authored by sammccall on Jun 21 2018, 9:26 AM.

Details

Summary

We now compute a distance from the main file to the symbol header, which
is a weighted count of:

  • some number of #include traversals from source file --> included file
  • some number of FS traversals from file --> parent directory
  • some number of FS traversals from parent directory --> child file/dir

This calculation is performed in the appropriate URI scheme.

This means we'll get some proximity boost from header files in main-file
contexts, even when these are in different directory trees.

This extended file proximity model is not yet incorporated in the index
interface/implementation.

Reviewer guide (decreasing order of complexity):

  • FileDistance is new, and contains the algorithm/structures for computing the file distances efficiently from a set of weighted roots
  • Headers has significant changes: the IncludeStructure class expands the scope of include info we previously captured
  • Quality uses these to implement the new definition of SemaProximityScore
  • CodeComplete has a bit of churn because the lifetimes there are so complicated
  • the rest is boilerplate

Diff Detail

Repository
rL LLVM

Event Timeline

sammccall created this revision.Jun 21 2018, 9:26 AM
sammccall updated this revision to Diff 152326.Jun 21 2018, 9:35 AM

Document some more stuff, and clang-format

sammccall edited the summary of this revision. (Show Details)Jun 21 2018, 9:39 AM

Looks great! Thanks for doing this!

The algorithm looks pretty efficient. It might still make sense to collect some performance numbers for real files with potentially large #include tree (we have plenty of those in our internal codebase :)

clangd/CodeComplete.cpp
908 ↗(On Diff #152326)

It might make sense to briefly explain somewhere how computational cost is distributed across the workflow.

938 ↗(On Diff #152326)

nit: Style?

clangd/FileDistance.cpp
72 ↗(On Diff #152326)

nit: this loop would probably be easier to understand if iterated reversely.

86 ↗(On Diff #152326)

This is done for every index symbol, so I wonder if we could avoid parsing URIs by assuming some URI structures.

clangd/FileDistance.h
56 ↗(On Diff #152326)

nit: It's unclear what the values of Roots are (it can be confused with tree roots or directory roots).

clangd/Headers.cpp
28 ↗(On Diff #152326)

What does this do?

clangd/Headers.h
62 ↗(On Diff #152326)

Does this return all transitive includes in the entire TU or just those from MainFileName? And what's MainFileName here? Is it the same as the main file in a TU?

67 ↗(On Diff #152326)

The real name can be empty. How do we handle empty path?

75 ↗(On Diff #152326)

Maybe fileIndex?

clangd/Quality.cpp
233 ↗(On Diff #152326)

nit: static

unittests/clangd/FileDistanceTests.cpp
36 ↗(On Diff #152326)

I think this is an interesting case. IIUC, 22u is contributed by 4 ups (4*5) and 1 include (1*2, via llvm/ADT/StringRef.h).

Suppose a/b/c/d/e/f.cc includes base/x.h, then the shortest path into directory other/ is likely to go via base/x.h. This might become a problem if base/x.h is very commonly used. One solution is probably to penalize edits that hit the root.

sammccall updated this revision to Diff 152479.Jun 22 2018, 7:54 AM
sammccall marked 9 inline comments as done.

Address review comments.

sammccall added inline comments.Jun 27 2018, 2:39 AM
clangd/FileDistance.cpp
35 ↗(On Diff #152326)

Note there was a nasty bug in FileDistance::FileDistance that failed to consider down edges. Fixed and test case added.

86 ↗(On Diff #152326)

(note we cache the URI string, so it's at most once per file)

We could, at the cost of potentially subtle bugs/dependencies. (what if the data doesn't make exactly the same choices regarding escaping?)

parse() is actually *fairly* cheap. Unlike resolve() or create(), it doesn't touch the URI scheme plugins at all, so nothing's virtual. It's a few calls to percentDecode() and a few string allocations.
I think there's room for optimization there (scheme should be a smallstring for one) but wouldn't go hacking around it unless it turns up on a profile.

clangd/FileDistance.h
56 ↗(On Diff #152326)

Changed roots -> sources everywhere, which is less ambiguous.

clangd/Headers.h
62 ↗(On Diff #152326)

It's transitive... there was actually a comment but it was in the wrong place!
Wrote a better comment and generalized slightly to make this seem less magic. You can actually ask for the files reachable starting at any file (MainFile is usually the one you want though).

67 ↗(On Diff #152326)

We attempt to record it every time we see it.
Fixed includeDepth() to exclude files from the output if we never get a real name for them.
This shouldn't be the case in practice - real name is missing for FileEntry when using a preamble, but we should see their names when *building* the preamble.

unittests/clangd/FileDistanceTests.cpp
36 ↗(On Diff #152326)

Yeah, this could be a weakness. I don't think this is specific to #includes, or to root directories.
I think the generalization is that there exist directories (like /usr/include) whose siblings are unrelated. Down-traversals in these directories should be more expensive. I've added a Caveats section to the FileDistance documentation, but I would like to avoid adding special cases until we can see how much difference they make.

ioeric added inline comments.Jun 28 2018, 4:53 AM
clangd/FileDistance.cpp
9 ↗(On Diff #152479)

Is this intentionally reserved?

51 ↗(On Diff #152479)

nit: this probably deserve a comment or a better formatting. It looks like part of the loop incremental.

86 ↗(On Diff #152479)

Should we also check the cache here?

unittests/clangd/FileDistanceTests.cpp
36 ↗(On Diff #152326)

As chatted offline, this could be still concerning as it's not uncommon for files to include headers from top-level directories. I think the idea you proposed - restricting up traversals from included files, could address the problem.

sammccall updated this revision to Diff 153504.Jun 29 2018, 9:00 AM
sammccall marked 2 inline comments as done.

Address review comments. Most notably: limit the up-traversals from some sources.

ioeric accepted this revision.Jul 2 2018, 2:14 AM

lgtm.

Again, we might still want to measure the performance impact on files with potentially large #include tree ;)

clangd/FileDistance.cpp
26 ↗(On Diff #153504)

Shouldn't /foo a down from / and 2+1 = 3?

This revision is now accepted and ready to land.Jul 2 2018, 2:14 AM
sammccall marked an inline comment as done.Jul 3 2018, 12:31 AM

Sigh... sorry, forgot to send these comments with the last patch. Will measure performance today.

clangd/FileDistance.cpp
9 ↗(On Diff #152479)

Oops, added implementation docs

51 ↗(On Diff #152479)

Moved out of the loop, now the loop has a counter. Hopefully less confusing?

86 ↗(On Diff #152479)

I'm not sure why - for the full path we'll look up the cached value in the first iteration through the loop. Ancestors will be empty, so this function should be pretty cheap for this happy path - dominated by computing the hash.

or do you mean to skip canonicalization?
This doesn't matter in practice as we'll use this through URIDistance which does a pre-canonicalization cache. If we used this directly then it might be worthwhile.

Performance on a transitively big file.
Stress test - global scope completion with index turned off.
Benchmarking codeComplete() call 100x per trial, trials randomly ordered.

Before this patch (427ms 421ms 430ms)
After this patch (418ms 420ms 417ms)
So it seems this is slightly faster, maybe 2%.

This revision was automatically updated to reflect the committed changes.

Hi Carlos, thanks for letting us know! I sent r336249 to fix the windows
test.

Hi Carlos, thanks for letting us know! I sent r336249 to fix the windows
test.

Hi @ioeric,

Thanks very much. Happy to help.