This is an archive of the discontinued LLVM Phabricator instance.

[clangd] IncludeCleaner: Mark used headers
ClosedPublic

Authored by kbobyrev on Aug 17 2021, 1:51 AM.

Diff Detail

Event Timeline

kbobyrev created this revision.Aug 17 2021, 1:51 AM
kbobyrev requested review of this revision.Aug 17 2021, 1:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2021, 1:51 AM
kbobyrev edited the summary of this revision. (Show Details)Aug 17 2021, 1:51 AM
kbobyrev planned changes to this revision.Aug 17 2021, 1:54 AM

Need to add some basic tests.

kbobyrev updated this revision to Diff 367140.Aug 18 2021, 1:12 AM

Rebase on top of main.

kbobyrev planned changes to this revision.Aug 18 2021, 1:13 AM

I realize many of the things I'll object to came from my own prototype, sorry about that :-\
I think/hope I gave you some forewarning about this!

clang-tools-extra/clangd/Headers.h
145

Hey, I recognize this code :-)

I think the key ideas here are that:

  • we're using opaque identifiers for the files, nothing is interesting about a file other than its (include) edges and (directly-referenced) color
  • the identity of headers is an impl detail of Headers.h rather than being something like a FileID, this allows us to hide messy details of files not having stable identity across preamble->main file

I think the second idea is important, but the first one might be a bit naive. I worry it's going to lead to certain rules being hard to implement, or being bundled into Headers.cpp instead of IncludeCleaner.

For example:

  • if a file is not self-contained, how does this affect the algorithm? (There's a FIXME for this, but it's in the wrong file!)
  • if a file is a standard library entrypoint?
  • if a file is a standard library impl detail?

I think the facts (is a file self-contained) should be part of IncludeStructure, but that we should expose them for IncludeCleaner to deal with, rather than trying to hide them in markUsed.
This means giving IncludeStructure a wider interface, which is we need to be careful about. It makes it harder to substitute algorithms by swapping out the UsedFunc, but I think this is only a cute trick and not actually important.


Concretely, I think I'd suggest just extending the public API to expose the "file index" concept:

  • expose the type using IncludeStructure::File = unsigned or so
  • add the file id to Inclusion
  • add File getFile(const FileEntry*)
  • add ArrayRef<File> getIncludedFiles(File)
  • in future, we can add e.g. const char* isStandardLibraryEntrypoint(File) or whatever

And implement all of markUsed in IncludeCleaner.
(Not 100% sure if we actually still need the Used ivar in Inclusion, come to think of it, maybe we just run this code in the diagnostic cycle)

clang-tools-extra/clangd/IncludeCleaner.cpp
144

nit: move this to be a line comment on the sort() call?

I think it's sufficiently nonobvious that sorting groups by file ID that it becomes nonobvious exactly what code the comment refers to!

clang-tools-extra/clangd/IncludeCleaner.h
51

so when *do* we perform this expansion?

Seems like you've wired this up end-to-end in this patch and we're just going to hit the elog case.

I think it's reasonable to put this expansion into findReferencedFiles fwiw, it's a fairly simple second pass and in practice combining them won't interfere with reasonable tests

52

comment just says spelling, not spelling/expansion, not sure if this is significant.

Expansion is actually the more obvious, but I do think we need both.

kbobyrev updated this revision to Diff 374174.Sep 22 2021, 3:57 AM
kbobyrev marked 4 inline comments as done.

Improve structure, address review comments.

Hey, sorry for the gigantic turn around. I still need to cover the code with few tests and polish it a bit more but I've updated the majority of it and pushed to get some early feedback before I do that. Please let me know if you have any concerns/see some problems with the approach I went for!

kbobyrev updated this revision to Diff 374219.Sep 22 2021, 7:00 AM

Populate Inclusion.ID, add a test (failing for now).

kbobyrev updated this revision to Diff 374226.Sep 22 2021, 7:19 AM

Make sure FileEntry* is not nullptr

sammccall added inline comments.Sep 22 2021, 11:39 PM
clang-tools-extra/clangd/Headers.h
61

Most includes are part of the preamble, so there are two relevant parse actions (preamble, and mainfile-using-preamble). Each has its own SourceManager and therefore namespace of FileIDs.

There's no rule that says a header gets the same FileID when a preamble is used. As written, RecordHeaders is assigning Inclusion::ID based on the preamble, and then we end up comparing it to FileIDs from compareUnusedIncludes().

From reading the ASTReader code, I *believe* that there's a simple offset between the two: e.g. that if a preamble uses FileIDs from 1-100, then these might be mapped to FileIDs 1501-1600 when that preamble is reused. We could go down the path of exploiting this. (Though we need to investigate the details and think a little about how it works with modules).

The somewhat less-coupled alternative we use today is to use the FileEntry::Name as documented in the private section of IncludeStructure. There are a few ways to build on top of this - basically we're either going to do most calculations in FileID space, or expose a "stable file index" from IncludeStructure and do most calculations in that space...

kbobyrev updated this revision to Diff 374469.Sep 23 2021, 1:23 AM

Perform the computation in the IncludeStructure::File space.

kbobyrev marked an inline comment as done.Sep 23 2021, 1:23 AM
kbobyrev updated this revision to Diff 375153.Sep 26 2021, 11:38 PM

Prepare for rebase: revert Headers.cpp and Headers.h

kbobyrev updated this revision to Diff 375154.Sep 26 2021, 11:50 PM

Rebase on top of D110386.

kbobyrev updated this revision to Diff 376425.Sep 30 2021, 11:21 PM

Rebase on top of main. Now ready for a review.

kbobyrev updated this revision to Diff 376430.Oct 1 2021, 12:00 AM

Fix the rebase

kbobyrev updated this revision to Diff 376431.Oct 1 2021, 12:06 AM

Tiny refactoring.

kbobyrev updated this revision to Diff 377112.Oct 5 2021, 1:22 AM

Rebase on top of landed patches.

Ping, @sammccall

Sorry, I thought i'd sent these comments...

clang-tools-extra/clangd/IncludeCleaner.cpp
158

Why are we passing around Inclusions by value?

clang-tools-extra/clangd/IncludeCleaner.h
51

This says FIXME but IIUC it's fixed.

56

this function is undocumented, unused and untested :-)

What's it for? Why does it not return a set?

kbobyrev updated this revision to Diff 377125.Oct 5 2021, 2:26 AM
kbobyrev marked 3 inline comments as done.

Resolve review comments.

kbobyrev updated this revision to Diff 377158.Oct 5 2021, 5:14 AM

Refactor FileID -> IncludeStructure::HeaderID into a separate function.

sammccall added inline comments.Oct 5 2021, 5:47 AM
clang-tools-extra/clangd/IncludeCleaner.cpp
158

Sorry, should have thought this through more before leaving the comment. There are a couple of questions really:

How should we *store* the information about which inclusions are unused?

  • not at all, generate ReferencedFiles and compute "is this header unused" on the fly when generating diagnostics
  • store ReferencedFiles but call "is this header unused" on the fly
  • store a boolean or something in each Inclusion
  • store a list of the inclusions that are unused

IMO this is mostly a question of what's the lifecycle of the info, and what's the simplest representation - seems like we should prefer stuff higher on the list if we have a choice.

What should the signature of the function be?
There doesn't seem to be any work saved here by processing all the includes as a batch - why not simplify by just making this bool isUnused(...) and let the caller/test choose what data structures to use?

159

EntryPoint is unused

166

Handle unresolved case somehow? (Or assert if you're sure it can't happen - I think it can for e.g. pp_file_not_found)

166

doesn't seem like there's any need to go through filenames for this.
Can't we just store the HeaderID in the Inclusion?
(Blech, as an unsigned to avoid a circular dependency)

168

elog says that:
a) this might happen
b) logging for the user is the best thing we can do

Can this actually happen? My suspicion is no. In which case maybe it should be an assert?

173

I'm fairly (more) certain this one should be an assert

193

assert?

196

this is get, not getOrCreate, so you don't need the mutable reference

198

I'm not totally sure whether this is safe to assert or not. WDYT?

In any case, please fix the message (FE -> HeaderID, add more context)

sammccall added inline comments.Oct 5 2021, 6:11 AM
clang-tools-extra/clangd/IncludeCleaner.cpp
158

OK I was confused, nothing is getting stored, but ParsedAST::getUnused() function creates/destroys the analysis data so it needs to run as a batch.

I think probably:

  • *this* function should just be a simple bool isUnused(...) and the loop lives in the caller
  • ParsedAST::getUnused() should become getUnused(const ParsedAST&) and live in this file

We have some circularity between ParsedAST and IncludeCleaner, but I think we're going to have that in any case due to findReferencedLocations()

kbobyrev updated this revision to Diff 377197.Oct 5 2021, 6:44 AM
kbobyrev marked 10 inline comments as done.

Address review comments.

sammccall accepted this revision.Oct 5 2021, 8:20 AM
sammccall added inline comments.
clang-tools-extra/clangd/Headers.h
61

I don't think we're under any size pressure here - Optional<unsigned>?

61

call the member HeaderID, rather than have this as a comment only?

clang-tools-extra/clangd/IncludeCleaner.cpp
161

SM is unused

165

this can just be a check whether MFI.ID is valid or not

171

ReferencedFiles.contains(IncludeID) and inline into the if?

clang-tools-extra/clangd/ParsedAST.h
159

I think this function belongs in IncludeCleaner.h.

(There's a circularity problem between ParsedAST and IncludeCleaner, but putting the function here doesn't fix it)

clang-tools-extra/clangd/unittests/IncludeCleanerTests.cpp
165

Don't bother doing this stripping just for the test IMO, it obscures the assertion (more than escaping quotes would)

This revision is now accepted and ready to land.Oct 5 2021, 8:20 AM
kbobyrev updated this revision to Diff 377264.Oct 5 2021, 8:38 AM
kbobyrev marked 6 inline comments as done.

Thank you for the review! Looks much better now.

This revision was automatically updated to reflect the committed changes.
kbobyrev marked an inline comment as done.Oct 5 2021, 9:43 AM