This is an archive of the discontinued LLVM Phabricator instance.

[include-cleaner] Ranking of providers based on hints
ClosedPublic

Authored by kadircet on Dec 13 2022, 1:52 AM.

Details

Summary

Introduce signals to rank providers of a symbol.

Diff Detail

Event Timeline

kadircet created this revision.Dec 13 2022, 1:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 13 2022, 1:52 AM
Herald added a subscriber: mgrang. · View Herald Transcript
kadircet requested review of this revision.Dec 13 2022, 1:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 13 2022, 1:52 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

I haven't read all code yet, left a few initial comments.

clang-tools-extra/include-cleaner/lib/AnalysisInternal.h
62

IIUC, we're using a union bit for symbols and headers: some properties are for both (e.g. Complete); some properties are just for headers (e.g. Public, Canonical).

I found that this model is hard to follow (e.g. for each property, I need to keep thinking about whether this property is for just symbols/headers or both, the hint conversion step from Symbol to Header is not obvious).

What do you think about the following alternative ?

  • define two hints (SymbolHint, HeaderHint)
  • an explicit conversion from SymbolHint to HeaderHint
  • SymbolLocation class has a SymbolHint member (for Header as well)
66

nit: since this is a bit union, I'd suggest to use = 1 << 0, = 1 << 1 syntax to define the value.

67

I think this is no non-private declaration. The Public property is for headers, it indicates the symbol is provided by a non-private header.

70

In most cases, this bit is set. If we use a flipped bit (Private), we don't need to set it in many places.

72

This comment is hard to follow, it doesn't explain the Canonical. From the design doc, it means NameMatch and Annotated headers, I think it would be nice to document them in the comment.

I'd suggest avoid using the Canonical name, it remains me too much about the canonical decl in clang. Maybe Preferred?

clang-tools-extra/include-cleaner/lib/FindHeaders.cpp
1

any reason to change the file property?

109

I think Exporter is the case we need to handle -- the exporter header should have a higher value (we probably need to add another hint bit).

112

if FE is private, do we want the Verbatim set the public bit? My inclination is yes (the current code behavior is no), it feels counterintuitive that headers from getPublic() is not marked as Public.

clang-tools-extra/include-cleaner/lib/LocateSymbol.cpp
22–40

nit: declHints?

clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp
179

I think we should have some tests for the hints bit (verify the hint bits are set correctly).

284

The result is correct according to the (canonical, public, complete) triple order.

But we can come up with a counterexample, e.g.

// transitive.h
#include "public_complete.h"

// main.cpp
#include "transitive.h"
Foo foo;

We will remove the transitive.h and insert the foo.proto.h, then we cause a compiling error (incomplete type of Foo). Not sure what we should do about it.

sammccall added inline comments.Dec 14 2022, 8:01 AM
clang-tools-extra/include-cleaner/lib/AnalysisInternal.h
61

along with SymbolLocation, I think mixing our fundamental types and analysis functions in this header makes the structure harder to understand and think they both belong in Types.h

62

I agree the model isn't obvious, but I think it is the right one and should be explained.
Breaking into a bunch of different types would be a mistake IMO. It would make it look less like algebra and more like domain logic. The problem is the algebra is important, e.g. it tells you how to merge hints to rank providers.

I think the model is:

  • we have a "provides" DAG, headers provide locations which provide symbols which satisfy AST nodes
  • these hints are "capabilities" that can be provided by edges (or subpaths).
  • if any path from Header => AST Node sees a capability, the header provides that capability.
  • this explains why most hints are expressed positively, and we take the union. (if a header provides a complete definition and an incomplete one, we want to call that "complete")
64

As discussed elsewhere, we could define this as "provides a generally-usable definition for a symbol", i.e. clear it for template/class forward-decls and set it for everything else. This isn't really more code, and makes it less likely we're going to subtly misinterpret the value.

67

I agree with this one: because hints aggregate as union, it would be a mistake to flag a Symbol as Public - that would mean any header that provided it would end up with the Public hint, even if the header file is private!

I wonder if we should use a naming-convention for this, e.g. CompleteSymbol, PublicHeader etc, indicating the layer at which the signal is expected to be set.

72

+1 to Preferred and also documenting the conditions when a signal is set here in some detail (also for Public and Complete).

I think we really have to understand what the signal is quite precisely to make correct use of it.

74–75

missed this in the previous review: both locateSymbol and headersForSymbol should probably be above writeHTMLReport

76

tarversing -> traversing
sorted per -> sorted by

(uber-nit: i.e. should be e.g. or simply deleted, as "top element first" is not equivalent to "list is sorted")

80

std::forward should just be std::move here, T is not a reference type

clang-tools-extra/include-cleaner/lib/FindHeaders.cpp
44

I would suggest making this function public (or operator< on Hinted<Header>) and unit-testing it to demonstrate the intended behavior.

The implementation is trivial now but may not always be, and editing enum constants at the end rather than in the correct position seems like an easy mistake to make.

53

doing string rendering in a comparator seems gratuitously wasteful.

call stable_sort instead? the input order was used prior to this patch and seems OK.
(or define a meaningful operator< on Header and call that, but it need not be done now)

62

again, this seems unneccesary - switch over the variants instead?

103

why never canonical?

In general, handling !PI as a special case up front seems like a bad idea: the purpose of supporting it being nullable is to allow the code below to used/tested without one. Seems we can get away with guarding PI-bits with if (PI) below.
(Avoiding the ternary initializer for CurrentHints would be a readability side-benefit!)

109

I think we discussed and thought that was a bit unsafe. export can simply mean "downstream code can assumes this symbol is available", not "i'm the real owner of these symbols".

Prominent example: https://github.com/protocolbuffers/protobuf/blob/14e579436fb53d0e8271c4d26d3d88c394d96454/src/google/protobuf/stubs/port.h#L90
(it's not relevant that the exported header in question is a standardish one)

112

+1, conceptually this is a reference to a sibling edge in the DAG, not an edge chaining off this one, so it should not inherit hints

183

why not define operator< on Header if you need one? this will come up again.

188

not sure if this is really a FIXME: it's a pretty deep consequence of the design that will never be fixed.

Since you're relying on the distinction between "header" and "Header" here, maybe be a bit more explicit? "If two Headers probably refer to the same file (e.g. Verbatim(foo.h) and Physical(/path/to/foo.h), we won't deduplicate them or merge their hints".

209

again, no need to stringify and then reparse here (in fact operator<< for Symbol is extra-inefficient, assuming that we never call it in algorithms), we have a perfectly good underlying string that's pretty easy to get at in the cases where matching is appropriate

213

BTW This stringify-and-parse isn't quite correct, e.g. for operator std::string it yields operator string but for operator typename T::foo() it yields foo!

(I realize we do this in clangd, which isn't great...)

clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp
179

these tests seem to be in the wrong file

179

yes, I think this will be easier to maintain & make complete if you test the hints extracted, test the sort order, and only smoke-test the interaction between the two

281

This is confusing, the most obvious interpretation is "this is the generated header for the .proto file that defines foo", but such a header would have a complete definition.

I can't think of a remotely realistic proto example that actually triggers this case, so maybe drop the references to proto? It seems to give some false intuitions about whether this behavior is correct.

284

yeah, this is one of the tradeoffs of not examining completeness requirements at use sites.

However it's also a reason to believe that operator< on Hints may not be a simple lexicographic compare, several "lesser" hints may be able to overpower a "greater" one ==> another reason to make the comparison public & test it.

298

This comment describes an implementation (IMO is unlikely to be the right one) to solve a problem (which is left implicit) - it would be nice to state the problem instead.

I think we should have some special handling for the main-file, but it's not clear that boosting it is the way to go.

For example, if the main-file provides a forward-decl for a class, we probably shouldn't consider any other decl of that class used. (The forward decl clearly signals that it doesn't intend to rely on some external decl). So maybe headersForSymbol should only return the main file, or return some sentinel "provided in main file" value instead of a list.

kadircet marked 26 inline comments as done.Jan 10 2023, 1:58 AM
kadircet added inline comments.
clang-tools-extra/include-cleaner/lib/AnalysisInternal.h
61

i still don't feel great about exposing these types as public API. would it help if I moved these into a TypesInternal.h ?

62

Yes, as discussed offline and Sam mentions in the comment idea is to have all hints expressed positively and take union of all the capabilities on the path from an AST node to a provider (also union in case there are multiple paths ending at the same provider). Each hint is to be provided by at most a single stage of the traversal (not sure why you think Complete is for both), I'll make this more concrete by renaming hint types to encode that in the name. I'll also add some docs explaining the model.

67

i agree with renaming these to indicate the stage that produced them.

70

as discussed, these need to be positively expressed (like others).

clang-tools-extra/include-cleaner/lib/FindHeaders.cpp
103

in the absence of PI, we can't really do much in the loop below (as we would assume each file is self-contained by default) and would bail out in the first run. Hence i don't see much difference between handling it here vs right after the calculation of CurrentHints down below (I am slightly inclined towards this version as it prevents thinking about both cases in the loop for no reason). I can move it into the body if you feel strongly about it though.

regarding canonicalness, right now it's inferred either by namematch (which is calculated in headersForSymbol) or via IWYU pragmas, which isn't available in absence of PI.

109

agreed, exports shouldn't really be boosted.

112

oops you're right. this was an oversight. same applies to exporters.

213

i think using the identifier name should be fine here, operators will be unnamed and won't get any name match.

clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp
179

these tests seem to be in the wrong file

I've chosen this one as the implementation lives in FindHeaders.cpp, do you think it belongs to AnalysisTests instead?

284

in addition to what Sam said, note that we'll preserve public_complete.h if it's already included directly.

kadircet updated this revision to Diff 487721.Jan 10 2023, 1:58 AM
kadircet marked 5 inline comments as done.
  • Address all comments but the ones on tests.
sammccall accepted this revision.Jan 19 2023, 7:27 AM

LG, just nits really!

clang-tools-extra/include-cleaner/include/clang-include-cleaner/Types.h
137

this is a bit confusing... kind of the opposite of what std::in_place means?
(All of the other constructors build the state in-place in some sense, but this one doesn't)

Maybe just a private struct Sentinel{}; instead of in_place_t?

Symbol does this without any sentinel at all, if that's causing ambiguity then it might be worth fixing both to be the same.

clang-tools-extra/include-cleaner/lib/AnalysisInternal.h
61

AnalysisInternal < TypesInternal < Types I think, so yes

(I think separating the concepts/data structures from the algorithm/implementation is the important concern here. I don't think defensively minimizing surface area is important as the library simply won't have many users, and that exposing these is a net win because it makes the structure of the system easier for human readers to understand. Anyway, up to you)

63

nit: touched => traversed? Since you haven't explicitly mentioned a graph here (which is fine), hint at it more specifically

68

"positive" makes intuitive sense, but not *quite* the right concept. (existential vs universal quantification)

Maybe just give an example:

We choose the boolean sense accordingly, e.g. "Public" rather than "Private",
because a header is good if it provides any public definition, even if it also provides private ones.
71

nit: maybe Hints over Hint to indicate it's a bitset

72

nit: the conditions seem well-documented now, but I think we should drop the e.g. these should be the actual definitions, not just examples.

87

nit: augment values with hints? (or "augment types with Hint", but I think values is clearer)

clang-tools-extra/include-cleaner/lib/FindHeaders.cpp
37

just stable_sort(reverse(Headers))?

(reverse() returns an rbegin/rend view which should DTRT)

45

this signature seems unneccesarily complicated/hard to understand

what about nameMatch(DeclName, Header) -> Hint (or ->bool) and write the trivial loop in the caller?

59

debug stuff, and below

60

even though it's a few lines, if you pull out StringRef basename(StringRef Header) this code looks a lot clearer (and if returning bool, directly return basename(H.physical()->getName()).equals_insensitive(DeclName) etc inline above)

127–134

headers() returns the alternatives in preferred order, so you only want to set PreferredHeader on the first.

146

Somewhere say what you're doing: deduplicating headers and merging their hints

clang-tools-extra/include-cleaner/lib/Types.cpp
115

I think comparing pointers is going to give inconsistent results, and would suggest comparing Name (without trying to normalize, just for at least reproducibility across identical runs)

clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp
179

Nope, I think I was just confused, sorry

274

I don't think it does, it looks like you're sorting by the FileEntry pointers - this may happen to be discovery order a lot of the time but it's of course not guaranteed.

This revision is now accepted and ready to land.Jan 19 2023, 7:27 AM
kadircet updated this revision to Diff 491286.Jan 23 2023, 3:39 AM
kadircet marked 23 inline comments as done.
  • Address comments & rebase
clang-tools-extra/include-cleaner/include/clang-include-cleaner/Types.h
137

without a disambiguation tag we get an ambigious conversion when we have a constructor call that looks like Header("foo.h") (whether to use Header(StringRef) or Header(variant<StringRef>), not sure if it's fine for an "inaccessible" constructor to participate in overload resolution, nor how variant can be viable here despite needing a double implicit conversion from stringref to variant)

This didn't bite us with Symbol yet because constructor takes a Decl& but variant stores a Decl*, and for Macro case we always explicitly initialized with spelling of Macro.

You're right about usage of inplace_t though, I was confused. using a sentinel tag instead.

clang-tools-extra/include-cleaner/lib/Types.cpp
115

i was rather worried about correctness of a name based match, e.g. we can have the same FileEntry pointers for foo.h and bar/foo.h but i guess it's OK to assume that a FileManager won't be accessed in the middle of a comparison sequence.

This revision was automatically updated to reflect the committed changes.