This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Filter out private proto symbols in SymbolCollector.
ClosedPublic

Authored by ioeric on May 11 2018, 5:46 AM.

Details

Summary

This uses heuristics to identify private proto symbols. For example,
top-level symbols whose name contains "_" are considered private. These symbols
are not expected to be used by users.

Diff Detail

Repository
rL LLVM

Event Timeline

ioeric created this revision.May 11 2018, 5:46 AM
ilya-biryukov added inline comments.May 11 2018, 6:24 AM
clangd/index/SymbolCollector.cpp
95 ↗(On Diff #146313)

NIT: reduce nesting by inverting if condition (see LLVM style guide)

155 ↗(On Diff #146313)

Maybe add a comment mentioning that we remove internal symbols from protobuf code generator here?
I can easily decipher what happens here, because I know of protos, but I can imagine the comment might be very helpful to someone reading the code and being unfamiliar with protos.

156 ↗(On Diff #146313)

We should also run the same code to filter our those decls coming from sema completions.
Otherwise, they might still pop up in completion results if internal proto symbols were deserialized from preamble by clang before running the code completion.

unittests/clangd/SymbolCollectorTests.cpp
713 ↗(On Diff #146313)

Maybe also test that the same symbol is not excluded in the file that does not end with .proto.h?

Can there be an option for this? This seems very library specific and could break other code bases. Ideally, there would be a generic mechanism for this kind of filtering, i.e. specify a pattern of excluded files or symbol names. But I understand this would be cumbersome because you want to filter only *some* symbol names in *some* files, so it would be difficult for users to specify this intersection of conditions on command-line arguments, for example. I think this needs to be discussed a bit more or have this turned off by default (with an option to turn on!) until there is a more general solution for this kind of filtering.

Can there be an option for this? This seems very library specific and could break other code bases. Ideally, there would be a generic mechanism for this kind of filtering, i.e. specify a pattern of excluded files or symbol names. But I understand this would be cumbersome because you want to filter only *some* symbol names in *some* files, so it would be difficult for users to specify this intersection of conditions on command-line arguments, for example. I think this needs to be discussed a bit more or have this turned off by default (with an option to turn on!) until there is a more general solution for this kind of filtering.

Having user-configurable filtering may certainly be useful, but requires some design to get right.
And even if we have it, I think there's value in automatically handling popular frameworks, unless we know it might break other code in practice.

E.g., for protobuf, we know that generated headers always end with .proto.h. We could also check for comments that proto compiler tends to put into the generated files to be sure. If we do that, I feel it's better to have the filtering for protos enabled by default, since there's almost zero chance that people had a file that ends with .proto.h and put a proto compiler comment into it.
But even the .proto.h ending seems like a good enough indication.

ioeric removed a subscriber: malaperle.

So, the first line of the file generated by proto compiler seems to be something like this:

// Generated by the protocol buffer compiler.  DO NOT EDIT!

If we check the symbol comes from a file with this comment, there will be zero chance that we guess it wrong.
And we can always filter, in addition to the user-provided filters that @malaperle proposed (which also sound like a very useful feature to me!)

@ioeric, @malaperle, @sammccall, WDYT?

Can there be an option for this? This seems very library specific and could break other code bases. Ideally, there would be a generic mechanism for this kind of filtering, i.e. specify a pattern of excluded files or symbol names. But I understand this would be cumbersome because you want to filter only *some* symbol names in *some* files, so it would be difficult for users to specify this intersection of conditions on command-line arguments, for example. I think this needs to be discussed a bit more or have this turned off by default (with an option to turn on!) until there is a more general solution for this kind of filtering.

Having user-configurable filtering may certainly be useful, but requires some design to get right.
And even if we have it, I think there's value in automatically handling popular frameworks, unless we know it might break other code in practice.

Here :)
http://www.sidefx.com/docs/hdk/_s_o_p___bone_capture_lines_8proto_8h_source.html

I feel it's better to have the filtering for protos enabled by default

I like the idea that things work "out of the box", we have to make sure that it doesn't make it buggy for certain code bases and impossible to work around.

So, the first line of the file generated by proto compiler seems to be something like this:

// Generated by the protocol buffer compiler.  DO NOT EDIT!

If we check the symbol comes from a file with this comment, there will be zero chance that we guess it wrong.
And we can always filter, in addition to the user-provided filters that @malaperle proposed (which also sound like a very useful feature to me!)

@ioeric, @malaperle, @sammccall, WDYT?

I think this is good if that's true that the comment is always there. I think it would be OK for this to be enabled by default, with a general option to turn heuristics off. Not sure what to call it... -use-symbol-filtering-heuristics :)

If handling for other libraries is added later it would be good to split out this code a bit. A collection of "filters" could be passed to symbol collector. Each filter/framework-handling could be in it's own source file. Later...

ioeric updated this revision to Diff 146338.May 11 2018, 9:07 AM
ioeric marked 3 inline comments as done.
  • Addressed a few comments.

Thanks for sharing the example Marc! It's a bit surprising to see files that are not protobuf-generated named proto.h.

I'm not a big fan of parsing file comment in proto. It seems a bit cumbersome and we might not be able (or too expensive) to do so for completion results from sema (if we do want to filter at completion time).

Pattern-based filtering could be an option as it wouldn't require code modification and could support potentially more filters, although I'm a bit worries about rules getting too complicated (e.g. filters on symbol kinds etc) or running into limitation.

But for now, it seems to me that the easiest approach is putting an option around proto heuristic for the symbol collector, until we need more filters.

clangd/index/SymbolCollector.cpp
156 ↗(On Diff #146313)

Just want to clarify before going further. IIUC, in *index-based* completion, the preamble could still contain symbols from headers such that sema completion could still give you symbols from headers.

If we do need to build the filter into code completion, we would need more careful design as code completion code path is more latency sensitive.

unittests/clangd/SymbolCollectorTests.cpp
713 ↗(On Diff #146313)

Sounds good.

Here :)
http://www.sidefx.com/docs/hdk/_s_o_p___bone_capture_lines_8proto_8h_source.html

Didn't take along to find an example. Thanks for digging this up. That looks like a good enough reason to not apply proto-specific filtering based solely on the filename...

I like the idea that things work "out of the box", we have to make sure that it doesn't make it buggy for certain code bases and impossible to work around.

Totally agree, we should only enable something that we all agree is good enough at detecting the frameworks that it never guesses wrong on real code.

If handling for other libraries is added later it would be good to split out this code a bit. A collection of "filters" could be passed to symbol collector. Each filter/framework-handling could be in it's own source file. Later...

That LG, I guess we can iterate on the design in a CL, design doc or an email thread. However, it's outside the scope of this patch probably.

I'm not a big fan of parsing file comment in proto. It seems a bit cumbersome and we might not be able (or too expensive) to do so for completion results from sema (if we do want to filter at completion time).

Why do you feel it's cubersome? Getting the first line of the file from SourceManager and looking at it seems easy.
I certainly don't see why this should be expensive if we do it at the right time (perhaps it means doing that when building the preamble and stashing the results alongside it, but that's also easy in our current setup).

clangd/index/SymbolCollector.cpp
156 ↗(On Diff #146313)

For reference. As discussed outside this thread, we might have decls from headers in the sema completion items.
It does not seem too hard to add filtering for those as well: essentially, we just need to call the same function at code completion time.

ioeric updated this revision to Diff 146770.May 15 2018, 2:55 AM
  • Add heuristic to reduce false position on identifying proto headers

I think this is good if that's true that the comment is always there. I think it would be OK for this to be enabled by default, with a general option to turn heuristics off. Not sure what to call it... -use-symbol-filtering-heuristics :)

@malaperle Having an option for filtering heuristics seems a bit confusing. We have other filters in the symbol collector that we think could improve user experience, and we don't provide options for those. Similarly, for proto headers, I think we could also get away without such an option if we strike for low/no false positive (e.g. correctly identify proto headers). If folks run into problems with the filter, we would like to understand the use cases and improve the filters. In general, we think the proto filter, when it works, would improve user experience.

I'm not a big fan of parsing file comment in proto. It seems a bit cumbersome and we might not be able (or too expensive) to do so for completion results from sema (if we do want to filter at completion time).

Why do you feel it's cubersome? Getting the first line of the file from SourceManager and looking at it seems easy.
I certainly don't see why this should be expensive if we do it at the right time (perhaps it means doing that when building the preamble and stashing the results alongside it, but that's also easy in our current setup).

@ilya-biryukov You are right, getting a working solution seems easy enough. I was more concerned about finding a design that doesn't intrude completion workflow with library specific logic. But this makes more sense now that we are not doing the filtering on code completion workflow.

clangd/index/SymbolCollector.cpp
156 ↗(On Diff #146313)

After further discussion, we agreed that filtering on code completion results may not be the right approach. For example, it would break assumptions in code completion workflow e.g. limit of completion results from index. Alternatively, we could push filtering to the symbol source level. Currently, we have two sources of symbols: sema and index, and both of them can gather symbols from headers, which are then merged in code completion. With this setup, we would need to apply filters on both sources if we want to do any filtering on header symbols. One solution (for index-based completion) is to make sema only collect symbols in main file (just for code completion) and make indexer index headers (current behavior), where we would only need to filter on index. This doesn't address problem for sema-only code completion, but we think it's not a priority to strike for feature parity between sema-based completion and index-based completion, which we don't really have at this point.

So to proceed:

  1. I'll go ahead with the current approach (filter index only) with a stricter check for proto headers.
  2. Make sema only collect symbols in main files.
  3. Potentially also apply the filter on sema completion results.

@malaperle to expand on what Eric said, adding proto hacks with false positives and no way to turn them off is indeed not the way to go here!
There's probably going to be other places we want to filter symbols too, and it should probably be extensible/customizable in some way.
We don't yet have enough examples to know what the structure should be (something regex based, a code-plugin system based on Registry, or something in between), thus the simplest/least invasive option for now (it's important for our internal rollout to have *some* mitigation in place).

@ioeric can you add a comment near the proto-filtering stuff indicating we should work out how to make this extensible?

I think this is good if that's true that the comment is always there. I think it would be OK for this to be enabled by default, with a general option to turn heuristics off. Not sure what to call it... -use-symbol-filtering-heuristics :)

We have other filters in the symbol collector that we think could improve user experience, and we don't provide options for those.

What others filters do you mean? If you mean skipping "members", symbols in main files, etc, I a working on making them not skipped, see D44954.

I think this is good if that's true that the comment is always there. I think it would be OK for this to be enabled by default, with a general option to turn heuristics off. Not sure what to call it... -use-symbol-filtering-heuristics :)

We have other filters in the symbol collector that we think could improve user experience, and we don't provide options for those.

What others filters do you mean? If you mean skipping "members", symbols in main files, etc, I a working on making them not skipped, see D44954.

I meant the filters in https://github.com/llvm-mirror/clang-tools-extra/blob/master/clangd/index/SymbolCollector.cpp#L93 e.g. filtering symbols in anonymous namespace, which we think should never appear in the index.

I think members are more interesting than the private proto symbols. We want to un-filter members because there are features that would use them, so indexing them makes sense. But private proto symbols should never be shown to users (e.g. in code completion or workspaceSymbols).

I also think adding an option for indexing members would actually make more sense because they might significantly increase the index size, and it would be good to have options to disable it if users don't use members (e.g. including members can increase size of our internal global index service, and we are not sure if we are ready for that).

@malaperle to expand on what Eric said, adding proto hacks with false positives and no way to turn them off is indeed not the way to go here!
There's probably going to be other places we want to filter symbols too, and it should probably be extensible/customizable in some way.
We don't yet have enough examples to know what the structure should be (something regex based, a code-plugin system based on Registry, or something in between), thus the simplest/least invasive option for now (it's important for our internal rollout to have *some* mitigation in place).
@ioeric can you add a comment near the proto-filtering stuff indicating we should work out how to make this extensible?

I agree with all of that. What I don't quite understand is why a flag is not ok? Just a fail-safe switch in the mean time? You can even leave it on by default so your internal service is not affected. We know for a fact that some code bases like Houdini won't work with this, at least there will be an option to make it work.

I think this is good if that's true that the comment is always there. I think it would be OK for this to be enabled by default, with a general option to turn heuristics off. Not sure what to call it... -use-symbol-filtering-heuristics :)

We have other filters in the symbol collector that we think could improve user experience, and we don't provide options for those.

What others filters do you mean? If you mean skipping "members", symbols in main files, etc, I a working on making them not skipped, see D44954.

I meant the filters in https://github.com/llvm-mirror/clang-tools-extra/blob/master/clangd/index/SymbolCollector.cpp#L93 e.g. filtering symbols in anonymous namespace, which we think should never appear in the index.

I'll be looking at adding them too. For workspaceSymbols it's useful to be able to find them and matches what we had before. But completion will not pull them from the index.

I think members are more interesting than the private proto symbols. We want to un-filter members because there are features that would use them, so indexing them makes sense. But private proto symbols should never be shown to users (e.g. in code completion or workspaceSymbols).

I also think adding an option for indexing members would actually make more sense because they might significantly increase the index size, and it would be good to have options to disable it if users don't use members (e.g. including members can increase size of our internal global index service, and we are not sure if we are ready for that).

It sounds like we'll need both flags. We should discuss that because I'm planning to add even more (almost all?) symbols. I don't think it's common that users won't want members for workspaceSymbols though, but I see how this is not good for the internal indexing service.

@malaperle to expand on what Eric said, adding proto hacks with false positives and no way to turn them off is indeed not the way to go here!
There's probably going to be other places we want to filter symbols too, and it should probably be extensible/customizable in some way.
We don't yet have enough examples to know what the structure should be (something regex based, a code-plugin system based on Registry, or something in between), thus the simplest/least invasive option for now (it's important for our internal rollout to have *some* mitigation in place).
@ioeric can you add a comment near the proto-filtering stuff indicating we should work out how to make this extensible?

I agree with all of that. What I don't quite understand is why a flag is not ok? Just a fail-safe switch in the mean time? You can even leave it on by default so your internal service is not affected.

I think a flag doesn't solve much of the problem, and adds new ones:

  • users have to find the flag, and work out how to turn it on in their editor, and (other than embedders) few will bother. And each flag hurts the usability of all the other flags.
  • if this heuristic is usable only sometimes, that's at codebase granularity, not user granularity. Flags don't work that way. (Static index currently has this problem...)
  • these flags end up in config files, so if we later remove the flag we'll *completely* break clangd for such users

We know for a fact that some code bases like Houdini won't work with this, at least there will be an option to make it work.

Is this still the case after the last revision (with the comment check?)
Agree we should only hardwire this on if we are confident that false positives are vanishingly small.

clangd/index/SymbolCollector.cpp
101 ↗(On Diff #146770)

We're going to end up calling this code on every decl/def we see.
Am I being paranoid by thinking we should check whether the file is a proto once, rather than doing a bunch of string matching every time?

112 ↗(On Diff #146770)

this asserts if the name is not a simple identifier (Maybe operators or something will trigger this?).

@malaperle to expand on what Eric said, adding proto hacks with false positives and no way to turn them off is indeed not the way to go here!
There's probably going to be other places we want to filter symbols too, and it should probably be extensible/customizable in some way.
We don't yet have enough examples to know what the structure should be (something regex based, a code-plugin system based on Registry, or something in between), thus the simplest/least invasive option for now (it's important for our internal rollout to have *some* mitigation in place).
@ioeric can you add a comment near the proto-filtering stuff indicating we should work out how to make this extensible?

I agree with all of that. What I don't quite understand is why a flag is not ok? Just a fail-safe switch in the mean time? You can even leave it on by default so your internal service is not affected.

I think a flag doesn't solve much of the problem, and adds new ones:

  • users have to find the flag, and work out how to turn it on in their editor, and (other than embedders) few will bother. And each flag hurts the usability of all the other flags.
  • if this heuristic is usable only sometimes, that's at codebase granularity, not user granularity. Flags don't work that way. (Static index currently has this problem...)
  • these flags end up in config files, so if we later remove the flag we'll *completely* break clangd for such users

I don't really agree with those points, but...

We know for a fact that some code bases like Houdini won't work with this, at least there will be an option to make it work.

Is this still the case after the last revision (with the comment check?)
Agree we should only hardwire this on if we are confident that false positives are vanishingly small.

...I hadn't noticed the latest version. I think it's safe enough in the new version that we don't need to discuss this much further until it becomes a bigger problem (more libraries, etc).

ioeric updated this revision to Diff 147032.May 16 2018, 3:15 AM
ioeric marked an inline comment as done.
  • Address review comments.
clangd/index/SymbolCollector.cpp
101 ↗(On Diff #146770)

s/Symbol/Decl/

We could store a cache in the symbol collector (just need to add another state in the class, remember to invalidate for a new ASTContext, make this a member etc), but I think the matching is cheap enough?

112 ↗(On Diff #146770)

Good catch!

...I hadn't noticed the latest version. I think it's safe enough in the new version that we don't need to discuss this much further until it becomes a bigger problem (more libraries, etc).

Sounds good, thanks!

@ilya-biryukov Could you take another look at the patch?

ilya-biryukov added inline comments.May 16 2018, 5:04 AM
clangd/ClangdLSPServer.cpp
274 ↗(On Diff #147032)

NIT: not related to this change, but maybe use std::move(Edits) to avoid extra copies.

clangd/index/SymbolCollector.cpp
127 ↗(On Diff #147032)

The heuristrics that rely on naming style seem too fragile.
More thorough heuristics should do better, but are more complicated.
Maybe we could leave a fixme saying that these can be improved in case we'll run into problems. WDYT?

unittests/clangd/SymbolCollectorTests.cpp
711 ↗(On Diff #147032)

Could you give an intuition on why is this considered private?
We don't filter out those operators from other headers, right? Why are proto headers special?

ioeric updated this revision to Diff 147058.May 16 2018, 5:12 AM
ioeric marked 3 inline comments as done.
  • Merge remote-tracking branch 'origin/master' into proto
  • Addressed review comments.
This revision is now accepted and ready to land.May 16 2018, 5:14 AM
This revision was automatically updated to reflect the committed changes.