This is an archive of the discontinued LLVM Phabricator instance.

[clangd] IncludeCleaner: Add support for IWYU pragma private
ClosedPublic

Authored by kbobyrev on Feb 22 2022, 12:21 AM.

Diff Detail

Event Timeline

kbobyrev created this revision.Feb 22 2022, 12:21 AM
kbobyrev requested review of this revision.Feb 22 2022, 12:21 AM
kbobyrev planned changes to this revision.Feb 22 2022, 12:22 AM

The patch is working but I didn't add tests yet. Also, I think I should move the suffix mapping to CanonicalIncludes structure rather than Headers.

What's the policy this patch intends to implement?

I'm a little concerned by building up maps of filenames with segment limits - it looks like some kind of heuristic/partial matching.

What's the policy this patch intends to implement?

I'm a little concerned by building up maps of filenames with segment limits - it looks like some kind of heuristic/partial matching.

Basically, handling the mappings that. IWYU pragmas provide.

Are you concerned about the limit itself or the way I'm trying to find these headers in general? I was afraid of consuming too much memory, hence the limit; it's not crucial to the implementation, though, I can remove it if your concern is the limit rather than the approach.

kbobyrev updated this revision to Diff 410490.Feb 22 2022, 2:32 AM

Move suffix map to CanonicalIncludes (where it logically belongs), remove
component limit there.

kbobyrev planned changes to this revision.Feb 22 2022, 2:32 AM

What's the policy this patch intends to implement?

I'm a little concerned by building up maps of filenames with segment limits - it looks like some kind of heuristic/partial matching.

Basically, handling the mappings that. IWYU pragmas provide.

Yes, I'm asking you to describe that handling in a couple of sentences so I can understand what you intend it to be.

e.g. for includecleaner in general, the approach is: walk the AST, map each node to decls it uses, map decls to locations, map locations to files, then check which includes point to files that were not identified.

Are you concerned about the limit itself or the way I'm trying to find these headers in general? I was afraid of consuming too much memory, hence the limit; it's not crucial to the implementation, though, I can remove it if your concern is the limit rather than the approach.

Mostly I'm unclear on:

  • whether we're comparing filenames among a small (e.g. the include stack - heuristics OK) or a large set (need to be very precise)
  • whether the partial matching is semantically important or a performance optimitzation
  • whether the concept of "real name" is significant or likely to cause problems
  • whether storing another copy of the names of all the transitive headers is actually necessary for what you're trying to do

What's the policy this patch intends to implement?

I'm a little concerned by building up maps of filenames with segment limits - it looks like some kind of heuristic/partial matching.

Basically, handling the mappings that. IWYU pragmas provide.

Yes, I'm asking you to describe that handling in a couple of sentences so I can understand what you intend it to be.

e.g. for includecleaner in general, the approach is: walk the AST, map each node to decls it uses, map decls to locations, map locations to files, then check which includes point to files that were not identified.

I'm trying to reuse the information about IWYU pragma private that is collected in Canoncalncludes to find the header responsible for including the private ones. Canonicalncludes stores information about the headers with IWYU pragma: prviate, include <XYZ>. When we're figuring out the responsible header in IncludeCleaner, I'm checking if the header has this pragma and then I'm trying to find the <XYZ> header that is mentioned in the pragma comment. For that I'm using a heuristic that the include stack of the main file should have a header whose suffix is <XYZ>. This because we never really have the full header name in the pragma comment and it might not be visible from the private header (and hence can not be resolved from there). I'm not sure if this heuristic will work well in practice, but I couldn't figure out a better one.

An alternative would probably be:

  • Walk #include directives and try to find the "umbrella headers"
  • For each include directive, query the included file and figure out if it has "IWYU pragma: private, include <XYZ>" through CanonicalIncludes
  • If it does, figure out if the supposed umbrella header name has the <XYZ> suffix, in that case attribute the private header to this public one

This all would need to be done after Canonicalncludes are complete, meaning this would probably belong to IncludeStructure instead. Maybe this is more precise, WDYT?

Are you concerned about the limit itself or the way I'm trying to find these headers in general? I was afraid of consuming too much memory, hence the limit; it's not crucial to the implementation, though, I can remove it if your concern is the limit rather than the approach.

Mostly I'm unclear on:

  • whether we're comparing filenames among a small (e.g. the include stack - heuristics OK) or a large set (need to be very precise)

Yes, I'm using just the include stack of the main file and the preamble. When you say "large set", do you mean the whole set of project headers?

The problem here is that for the include stack we would miss the right diagnostic if the public header responsible for the private one is not accessible. Maybe we should throw an "unused" warning either way here, I don't know what would be right.

  • whether the partial matching is semantically important or a performance optimitzation

The matching itself is an optimisation, right: what I'm dealing with is that gtest.h header is called /home/user/llvm-project/llvm/util/googletest/.../gtest/gtest.h but the private headers will only say include "gtest/gtest.h". What I'm trying to do is find the header whose real path name ends with gtest/gtest.h.

  • whether the concept of "real name" is significant or likely to cause problems

Hm, I don't know; what kind of problems do you think might appear?

  • whether storing another copy of the names of all the transitive headers is actually necessary for what you're trying to do

The problem here is that I'll need FileID in headerResponsbile but I can't reuse FileIDs between preamble and main file, right? Logically, I'm afraid I don't have an option to store anything other than the name (or an index to it); I could uplift this logic to the IncludeStructure and loop over all visible headers in there, I wouldn't need another copy since I'm already storing it there. But that might be quite slow.

What's the policy this patch intends to implement?

I'm a little concerned by building up maps of filenames with segment limits - it looks like some kind of heuristic/partial matching.

Basically, handling the mappings that. IWYU pragmas provide.

Yes, I'm asking you to describe that handling in a couple of sentences so I can understand what you intend it to be.

e.g. for includecleaner in general, the approach is: walk the AST, map each node to decls it uses, map decls to locations, map locations to files, then check which includes point to files that were not identified.

I'm trying to reuse the information about IWYU pragma private that is collected in Canoncalncludes to find the header responsible for including the private ones. Canonicalncludes stores information about the headers with IWYU pragma: prviate, include <XYZ>.

This makes sense, though note that CanonicalIncludes predates IncludeStructure and had a pretty narrow purpose, it _might_ make more sense to record/store the data elsewhere, we should consider all options.

When we're figuring out the responsible header in IncludeCleaner, I'm checking if the header has this pragma and then I'm trying to find the <XYZ> header that is mentioned in the pragma comment.
For that I'm using a heuristic that the include stack of the main file should have a header whose suffix is <XYZ>.

So all we *know* is that the header that should be included can be spelled <XYZ>.
We don't know that it's on the include stack for the header defining the symbol, but we also don't know that it's included at all! (This is similar to the stdlib case).

It's true that for "unused include" warning we can ignore it if not included.
Still, it seems like a bad sign that we're trying to model headers referenced in this was as FileIDs rather than e.g. strings.
(And it will be harder to reuse for "missing include" warnings).

This because we never really have the full header name in the pragma comment an?d
it might not be visible from the private header (and hence can not be resolved from there)

Do you have an example? For include insertion purposes, we assume the header name can be included verbatim.
That implies it should be resolvable against the include path of any TU that sees the private header. (That said, performing IO might have its own problems)

An alternative would probably be:

  • Walk #include directives and try to find the "umbrella headers"
  • For each include directive, query the included file and figure out if it has "IWYU pragma: private, include <XYZ>" through CanonicalIncludes
  • If it does, figure out if the supposed umbrella header name has the <XYZ> suffix, in that case attribute the private header to this public one

This all would need to be done after Canonicalncludes are complete, meaning this would probably belong to IncludeStructure instead. Maybe this is more precise, WDYT?

I don't really understand this alternative, because "try to find the umbrella headers" seems to be begging the question.
What is an umbrella header apart from one referenced by an IWYU private pragma, and how would we identify them?

  • whether we're comparing filenames among a small (e.g. the include stack - heuristics OK) or a large set (need to be very precise)

Yes, I'm using just the include stack of the main file and the preamble. When you say "large set", do you mean the whole set of project headers?

Small set = include stack = chain of includes from main file -> private header. (But public header may not be on that stack).
Large set = include graph = all preamble headers.

The problem here is that for the include stack we would miss the right diagnostic if the public header responsible for the private one is not accessible. Maybe we should throw an "unused" warning either way here, I don't know what would be right.

  • whether the partial matching is semantically important or a performance optimitzation

The matching itself is an optimisation, right: what I'm dealing with is that gtest.h header is called /home/user/llvm-project/llvm/util/googletest/.../gtest/gtest.h but the private headers will only say include "gtest/gtest.h". What I'm trying to do is find the header whose real path name ends with gtest/gtest.h.

There's a few issues here:

  • there might be multiple "gtest/gtest.h" (#include_next) though typically only in the stdlib
  • ignoring the include path might lead to an incorrect match (private; include "config.h" should not match <llvm/config.h>)
  • "real" path name *sounds* nice, but may not be the right one

I think the thing we're really trying to express is: #include "gtest/gtest.h" is not unused.
We can loosen that to match #include <gtest/gtest.h>, #include "path/gtest/gtest.h" etc if we want, but ideally we'd do that explicitly rather than as a side-effect of the implementation.

Again, I think the clearest way to express this intent is simply to bubble the string "gtest/gtest.h" all the way up to the "which inclusions are unused" check.

  • whether the concept of "real name" is significant or likely to cause problems

Hm, I don't know; what kind of problems do you think might appear?

  • whether storing another copy of the names of all the transitive headers is actually necessary for what you're trying to do

The problem here is that I'll need FileID in headerResponsbile ut I can't reuse FileIDs between preamble and main file, right?

When we'd discussed this in the past, we'd talked about doing this in HeaderID space instead of FileID space.
headerResponsible() is the right API/point in the code for non-include-guarded headers, but may not be the right fit for pragmas, even if it means a bit of unfortunate duplication.

(Again, I think strings are probably better than HeaderIDs )

kbobyrev updated this revision to Diff 417940.Mar 24 2022, 8:36 AM

Switch to a different model of recording includes that differentiates between
user includes and public headers recored through IWYU pragmas.

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2022, 8:36 AM
kbobyrev updated this revision to Diff 418155.Mar 25 2022, 1:33 AM

Fix a bug, add tests for diagnostics. This is ready for a review now.

kbobyrev updated this revision to Diff 418159.Mar 25 2022, 2:12 AM

Add some docs, rebase on top of main.

kbobyrev updated this revision to Diff 418160.Mar 25 2022, 2:14 AM

Use better name for header recorder lambda.

Thanks, this is much nicer!

clang-tools-extra/clangd/IncludeCleaner.cpp
310

this signature/hook is a bit hard to understand (and undocumented).

I think understanding at least the idea of an IWYU mapping (if not the syntax) is in scope for this library, and I think it would be clearer to have a function<Optional<StringRef>(FileID)> specifically to look up those mappings instead

(It's fine that clangd uses the existing CanonicalIncludes mapping for now, but I also think when we pull this out as a library, that library will provide some facility for recording these mappings, so the interface here shouldn't be too high-level)

clang-tools-extra/clangd/IncludeCleaner.h
62

Specify whether the spelling includes quotes or not

64

Maybe call this ExactSpellings or SpelledUmbrellas? I think public vs private isn't the right distinction, stdlib headers and user headers are also often public.

clang-tools-extra/clangd/unittests/DiagnosticsTests.cpp
1640

Must we test this here (which is AIUI a smoke test of emitting diagnostics) instead of in the getUnused tests in IncludeCleanerTests?

1690

This doesn't seem strictly correct (I assume it fires even if we don't include public.h).
There's a problem here, but it isn't that the header is unused. And worse, at the moment we can't help you fix this problem, and the suggested change here is likely to break the build.

For unused, I think we probably want to mark both as used?
(For missing, we'd only require the public header)

kbobyrev updated this revision to Diff 419378.Mar 31 2022, 2:47 AM
kbobyrev marked 5 inline comments as done.

Address review comments.

sammccall accepted this revision.Mar 31 2022, 3:46 AM
This revision is now accepted and ready to land.Mar 31 2022, 3:46 AM
This revision was landed with ongoing or failed builds.Mar 31 2022, 3:50 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Mar 31 2022, 8:11 AM

Looks like this breaks tests on Windows: http://45.33.8.238/win/55402/step_9.txt

Please take a look, and revert for now if it takes a while to fix.