This uses heuristics to identify private proto symbols. For example,
top-level symbols whose name contains "_" are considered private. These symbols
are not expected to be used by users.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
clangd/index/SymbolCollector.cpp | ||
---|---|---|
95 ↗ | (On Diff #146313) | NIT: reduce nesting by inverting if condition (see LLVM style guide) |
155 ↗ | (On Diff #146313) | Maybe add a comment mentioning that we remove internal symbols from protobuf code generator here? |
156 ↗ | (On Diff #146313) | We should also run the same code to filter our those decls coming from sema completions. |
unittests/clangd/SymbolCollectorTests.cpp | ||
713 ↗ | (On Diff #146313) | Maybe also test that the same symbol is not excluded in the file that does not end with .proto.h? |
Can there be an option for this? This seems very library specific and could break other code bases. Ideally, there would be a generic mechanism for this kind of filtering, i.e. specify a pattern of excluded files or symbol names. But I understand this would be cumbersome because you want to filter only *some* symbol names in *some* files, so it would be difficult for users to specify this intersection of conditions on command-line arguments, for example. I think this needs to be discussed a bit more or have this turned off by default (with an option to turn on!) until there is a more general solution for this kind of filtering.
Having user-configurable filtering may certainly be useful, but requires some design to get right.
And even if we have it, I think there's value in automatically handling popular frameworks, unless we know it might break other code in practice.
E.g., for protobuf, we know that generated headers always end with .proto.h. We could also check for comments that proto compiler tends to put into the generated files to be sure. If we do that, I feel it's better to have the filtering for protos enabled by default, since there's almost zero chance that people had a file that ends with .proto.h and put a proto compiler comment into it.
But even the .proto.h ending seems like a good enough indication.
So, the first line of the file generated by proto compiler seems to be something like this:
// Generated by the protocol buffer compiler. DO NOT EDIT!
If we check the symbol comes from a file with this comment, there will be zero chance that we guess it wrong.
And we can always filter, in addition to the user-provided filters that @malaperle proposed (which also sound like a very useful feature to me!)
@ioeric, @malaperle, @sammccall, WDYT?
Here :)
http://www.sidefx.com/docs/hdk/_s_o_p___bone_capture_lines_8proto_8h_source.html
I feel it's better to have the filtering for protos enabled by default
I like the idea that things work "out of the box", we have to make sure that it doesn't make it buggy for certain code bases and impossible to work around.
I think this is good if that's true that the comment is always there. I think it would be OK for this to be enabled by default, with a general option to turn heuristics off. Not sure what to call it... -use-symbol-filtering-heuristics :)
If handling for other libraries is added later it would be good to split out this code a bit. A collection of "filters" could be passed to symbol collector. Each filter/framework-handling could be in it's own source file. Later...
Thanks for sharing the example Marc! It's a bit surprising to see files that are not protobuf-generated named proto.h.
I'm not a big fan of parsing file comment in proto. It seems a bit cumbersome and we might not be able (or too expensive) to do so for completion results from sema (if we do want to filter at completion time).
Pattern-based filtering could be an option as it wouldn't require code modification and could support potentially more filters, although I'm a bit worries about rules getting too complicated (e.g. filters on symbol kinds etc) or running into limitation.
But for now, it seems to me that the easiest approach is putting an option around proto heuristic for the symbol collector, until we need more filters.
clangd/index/SymbolCollector.cpp | ||
---|---|---|
156 ↗ | (On Diff #146313) | Just want to clarify before going further. IIUC, in *index-based* completion, the preamble could still contain symbols from headers such that sema completion could still give you symbols from headers. If we do need to build the filter into code completion, we would need more careful design as code completion code path is more latency sensitive. |
unittests/clangd/SymbolCollectorTests.cpp | ||
713 ↗ | (On Diff #146313) | Sounds good. |
Here :)
http://www.sidefx.com/docs/hdk/_s_o_p___bone_capture_lines_8proto_8h_source.html
Didn't take along to find an example. Thanks for digging this up. That looks like a good enough reason to not apply proto-specific filtering based solely on the filename...
I like the idea that things work "out of the box", we have to make sure that it doesn't make it buggy for certain code bases and impossible to work around.
Totally agree, we should only enable something that we all agree is good enough at detecting the frameworks that it never guesses wrong on real code.
If handling for other libraries is added later it would be good to split out this code a bit. A collection of "filters" could be passed to symbol collector. Each filter/framework-handling could be in it's own source file. Later...
That LG, I guess we can iterate on the design in a CL, design doc or an email thread. However, it's outside the scope of this patch probably.
I'm not a big fan of parsing file comment in proto. It seems a bit cumbersome and we might not be able (or too expensive) to do so for completion results from sema (if we do want to filter at completion time).
Why do you feel it's cubersome? Getting the first line of the file from SourceManager and looking at it seems easy.
I certainly don't see why this should be expensive if we do it at the right time (perhaps it means doing that when building the preamble and stashing the results alongside it, but that's also easy in our current setup).
clangd/index/SymbolCollector.cpp | ||
---|---|---|
156 ↗ | (On Diff #146313) | For reference. As discussed outside this thread, we might have decls from headers in the sema completion items. |
I think this is good if that's true that the comment is always there. I think it would be OK for this to be enabled by default, with a general option to turn heuristics off. Not sure what to call it... -use-symbol-filtering-heuristics :)
@malaperle Having an option for filtering heuristics seems a bit confusing. We have other filters in the symbol collector that we think could improve user experience, and we don't provide options for those. Similarly, for proto headers, I think we could also get away without such an option if we strike for low/no false positive (e.g. correctly identify proto headers). If folks run into problems with the filter, we would like to understand the use cases and improve the filters. In general, we think the proto filter, when it works, would improve user experience.
I'm not a big fan of parsing file comment in proto. It seems a bit cumbersome and we might not be able (or too expensive) to do so for completion results from sema (if we do want to filter at completion time).
Why do you feel it's cubersome? Getting the first line of the file from SourceManager and looking at it seems easy.
I certainly don't see why this should be expensive if we do it at the right time (perhaps it means doing that when building the preamble and stashing the results alongside it, but that's also easy in our current setup).
@ilya-biryukov You are right, getting a working solution seems easy enough. I was more concerned about finding a design that doesn't intrude completion workflow with library specific logic. But this makes more sense now that we are not doing the filtering on code completion workflow.
clangd/index/SymbolCollector.cpp | ||
---|---|---|
156 ↗ | (On Diff #146313) | After further discussion, we agreed that filtering on code completion results may not be the right approach. For example, it would break assumptions in code completion workflow e.g. limit of completion results from index. Alternatively, we could push filtering to the symbol source level. Currently, we have two sources of symbols: sema and index, and both of them can gather symbols from headers, which are then merged in code completion. With this setup, we would need to apply filters on both sources if we want to do any filtering on header symbols. One solution (for index-based completion) is to make sema only collect symbols in main file (just for code completion) and make indexer index headers (current behavior), where we would only need to filter on index. This doesn't address problem for sema-only code completion, but we think it's not a priority to strike for feature parity between sema-based completion and index-based completion, which we don't really have at this point. So to proceed:
|
@malaperle to expand on what Eric said, adding proto hacks with false positives and no way to turn them off is indeed not the way to go here!
There's probably going to be other places we want to filter symbols too, and it should probably be extensible/customizable in some way.
We don't yet have enough examples to know what the structure should be (something regex based, a code-plugin system based on Registry, or something in between), thus the simplest/least invasive option for now (it's important for our internal rollout to have *some* mitigation in place).
@ioeric can you add a comment near the proto-filtering stuff indicating we should work out how to make this extensible?
What others filters do you mean? If you mean skipping "members", symbols in main files, etc, I a working on making them not skipped, see D44954.
I meant the filters in https://github.com/llvm-mirror/clang-tools-extra/blob/master/clangd/index/SymbolCollector.cpp#L93 e.g. filtering symbols in anonymous namespace, which we think should never appear in the index.
I think members are more interesting than the private proto symbols. We want to un-filter members because there are features that would use them, so indexing them makes sense. But private proto symbols should never be shown to users (e.g. in code completion or workspaceSymbols).
I also think adding an option for indexing members would actually make more sense because they might significantly increase the index size, and it would be good to have options to disable it if users don't use members (e.g. including members can increase size of our internal global index service, and we are not sure if we are ready for that).
I agree with all of that. What I don't quite understand is why a flag is not ok? Just a fail-safe switch in the mean time? You can even leave it on by default so your internal service is not affected. We know for a fact that some code bases like Houdini won't work with this, at least there will be an option to make it work.
I'll be looking at adding them too. For workspaceSymbols it's useful to be able to find them and matches what we had before. But completion will not pull them from the index.
I think members are more interesting than the private proto symbols. We want to un-filter members because there are features that would use them, so indexing them makes sense. But private proto symbols should never be shown to users (e.g. in code completion or workspaceSymbols).
I also think adding an option for indexing members would actually make more sense because they might significantly increase the index size, and it would be good to have options to disable it if users don't use members (e.g. including members can increase size of our internal global index service, and we are not sure if we are ready for that).
It sounds like we'll need both flags. We should discuss that because I'm planning to add even more (almost all?) symbols. I don't think it's common that users won't want members for workspaceSymbols though, but I see how this is not good for the internal indexing service.
I think a flag doesn't solve much of the problem, and adds new ones:
- users have to find the flag, and work out how to turn it on in their editor, and (other than embedders) few will bother. And each flag hurts the usability of all the other flags.
- if this heuristic is usable only sometimes, that's at codebase granularity, not user granularity. Flags don't work that way. (Static index currently has this problem...)
- these flags end up in config files, so if we later remove the flag we'll *completely* break clangd for such users
We know for a fact that some code bases like Houdini won't work with this, at least there will be an option to make it work.
Is this still the case after the last revision (with the comment check?)
Agree we should only hardwire this on if we are confident that false positives are vanishingly small.
clangd/index/SymbolCollector.cpp | ||
---|---|---|
101 ↗ | (On Diff #146770) | We're going to end up calling this code on every decl/def we see. |
112 ↗ | (On Diff #146770) | this asserts if the name is not a simple identifier (Maybe operators or something will trigger this?). |
I don't really agree with those points, but...
We know for a fact that some code bases like Houdini won't work with this, at least there will be an option to make it work.
Is this still the case after the last revision (with the comment check?)
Agree we should only hardwire this on if we are confident that false positives are vanishingly small.
...I hadn't noticed the latest version. I think it's safe enough in the new version that we don't need to discuss this much further until it becomes a bigger problem (more libraries, etc).
- Address review comments.
clangd/index/SymbolCollector.cpp | ||
---|---|---|
101 ↗ | (On Diff #146770) | s/Symbol/Decl/ We could store a cache in the symbol collector (just need to add another state in the class, remember to invalidate for a new ASTContext, make this a member etc), but I think the matching is cheap enough? |
112 ↗ | (On Diff #146770) | Good catch! |
...I hadn't noticed the latest version. I think it's safe enough in the new version that we don't need to discuss this much further until it becomes a bigger problem (more libraries, etc).
Sounds good, thanks!
@ilya-biryukov Could you take another look at the patch?
clangd/ClangdLSPServer.cpp | ||
---|---|---|
274 ↗ | (On Diff #147032) | NIT: not related to this change, but maybe use std::move(Edits) to avoid extra copies. |
clangd/index/SymbolCollector.cpp | ||
127 ↗ | (On Diff #147032) | The heuristrics that rely on naming style seem too fragile. |
unittests/clangd/SymbolCollectorTests.cpp | ||
711 ↗ | (On Diff #147032) | Could you give an intuition on why is this considered private? |