This is an archive of the discontinued LLVM Phabricator instance.

[clang-doc] Implement reducer portion of the frontend framework
ClosedPublic

Authored by juliehockett on Feb 15 2018, 10:15 AM.

Details

Summary

Implements a simple, in-memory reducer for the mapped output of the initial tool. This creates a collection object for storing the deduplicated infos on each declaration, and populates that from the mapper output. The collection object is serialized to LLVM bitstream. On reading each serialized output, it checks to see if a merge is necessary and if so, merges the new info with the existing info (prefering the existing one if conflicts exist). Also implements the bitcode reader.

For a more detailed overview of the tool, see the design document on the mailing list: RFC: clang-doc proposal

Diff Detail

Repository
rL LLVM

Event Timeline

juliehockett created this revision.Feb 15 2018, 10:15 AM
juliehockett edited the summary of this revision. (Show Details)

Moving the entire implementation of the bitcode reader to this patch (from the mapper patch, here) and cleaning up implementation

Updating for parent diff changes

Please run Clang-format and Clang-tidy modernize.

clang-doc/BitcodeReader.h
36 ↗(On Diff #135520)

Please use = default;

lebedev.ri set the repository for this revision to rCTE Clang Tools Extra.
lebedev.ri removed a subscriber: lebedev.ri.
juliehockett marked an inline comment as done.

Cleaning up some and updating based on changes to the parent diff.

Updating for parent diff & refactoring reader.

Thank you for working on this!

Finally trying to review this...
I must say i'm really not fond of the test 'changes'.

But some initial comments added:

clang-doc/BitcodeReader.cpp
19 ↗(On Diff #136793)

I think all these SmallString can be one llvm::SmallVectorImpl<char>?

454 ↗(On Diff #136793)

RecordNames is basically not kept between the reuses of ClangDocBitcodeReader, but is also not actually properly gc'd.
Also, this uses BI_FIRST / BI_LAST, which means the versioning is actually absolutely required...
Also, RecordNames isn't actually used anywhere later in this code.

So, Is that needed for the next patches? Or why read this at all?

clang-doc/BitcodeReader.h
36 ↗(On Diff #136793)

Bits is not an output parameter, but just a const input parameter?
I think it would be cleaner to use StringRef, to avoid depending on the actual size of the small-size.

46 ↗(On Diff #136793)

Please separate those tree functions and this template function with one newline.

48 ↗(On Diff #136793)

T &&I looks super vague.
T is {something}Info, which is inherited from Info base-class.
Maybe something like InfoT &&I ?

clang-doc/Reducer.cpp
19 ↗(On Diff #136793)

As far as i can tell, Reader isn't required to be kept out here.
It seems it's not used to retain any internal state.
The only obvious reason to keep it, is to avoid de-allocating and then re-allocating the memory each time.

I'm wondering if it would be better to move it into the lambda.
Also, why is it prefixed with the doc::? We are in that namespace already.

Then, you will also be able to get rid of llvm::BitstreamCursor &Stream params in the ClangDocBitcodeReader member functions, like with ClangDocBitcodeWriter.

clang-doc/Representation.h
193 ↗(On Diff #136793)

Similarly, i think those should be SmallVectorImpl (i assume those are output params, too?)

test/clang-doc/mapper-class-in-class.cpp
1 ↗(On Diff #136793)

What's up with all tests being deleted? That is rather disturbing.
I would expect the merge phase to be disable-able, if only just for the testing purposes.

juliehockett marked 8 inline comments as done.

Adding in support for mapper tests and addressing comments.

juliehockett added inline comments.Mar 6 2018, 12:25 PM
clang-doc/BitcodeReader.cpp
19 ↗(On Diff #136793)

No, since there's not an implicit converter from llvm::SmallVectorImpl<char> to StringRef. I templatized it on the size though, so it's only one function now.

clang-doc/Representation.h
193 ↗(On Diff #136793)

Yup -- the pointer inside gets set.

Once the base differential firmly lands, could you please rebase this so the review could continue?

Rebasing and updating.

It's good to finally have the initial block firmly landed, congratulations.

Trying to review this...
Some initial thoughts.

clang-doc/BitcodeReader.cpp
19 ↗(On Diff #136793)

Are you sure?
https://godbolt.org/g/wD1FKD

That isn't pretty, but i think it beats templating in this case..

27 ↗(On Diff #139644)

Ok, i don't understand what is going on here.
Where is this E defined?
This looks like [0] is the number of elements to read (always 20, sha1 blob?),
and it copies Record[1]..Record[20] to Field[0]..Field[19] ?
I think this can/should be rewritten clearer..

clang-doc/BitcodeReader.h
33 ↗(On Diff #139644)

Similarly, i think this should be

bool readBitstream(InfoSet &IS);
44 ↗(On Diff #139644)

As far as i can see this should be

template <typename TInfo>
bool readBlockToInfo(unsigned ID, InfoSet &IS);
clang-doc/BitcodeWriter.h
142 ↗(On Diff #139644)

And here too, it does not make much sense to call it with empty std::unique_ptr<>, so maybe

void emitInfoSet(InfoSet &ISet);

?

clang-doc/Reducer.cpp
18 ↗(On Diff #139644)

I can see that you may return nullptr; in case of error here, thus it's std::unique_ptr<>
InfoSet internally is just a few std::vector<>s, so it should std::move() efficiently.
I'm wondering if llvm::Optional<InfoSet> would work too?

clang-doc/Representation.cpp
19 ↗(On Diff #139644)

I though this was changed, and the non-stringified, original binary version of sha1 was emitted into bitcode?

28 ↗(On Diff #139644)

???
Where *to* is it moved?

40 ↗(On Diff #139644)

All these mergeInfo(): the second param, Info &R should probably be Info &&R.
(but not mergeInfoBase()/mergeSymbolInfoBase())

98 ↗(On Diff #139644)

Seeing how many std::move()'ing is happening in those mergeInfo(), i think you want to move I, not pass as reference.
Especially since it is already &&I in this InfoSet::insert() function.

clang-doc/Representation.h
30 ↗(On Diff #139644)

Please add a comment explaining that SymbolID is sha1, and not hex-string of sha1.

135 ↗(On Diff #139644)

Why is the move constructor explicitly defined?
Unless i'm missing something, the default one would do exactly the same.
https://godbolt.org/g/XqsRuX

clang-doc/tool/ClangDocMain.cpp
130 ↗(On Diff #139644)

That is the same small-size used in static std::string serialize(T &I).
Some statistic is needed, bu i think this one can be bumped somewhat right away.

141 ↗(On Diff #139644)

This shares some code with if(DumpMapperResult).
Perhaps you could refactor it slightly, to reduce code duplication?

juliehockett marked 13 inline comments as done.

Addressing comments

Please mention new tool in Release Notes and use :doc: to refer to its manual.

It's surprisingly difficult to review this :/
Just by itself the logic is simple, but the amount of blocks/records/record types is kinda demoralizing.
I really hope it will be possible to somehow tablegen this later.

clang-doc/BitcodeReader.cpp
363 ↗(On Diff #139869)

Hmm, why does this function exist?
The only difference from the previous one is the std::unique_ptr<>.

488 ↗(On Diff #139869)

This signature should be in one place (header?), and then just used here and in ClangDocBitcodeWriter::emitHeader()

clang-doc/BitcodeWriter.cpp
537 ↗(On Diff #139869)

It should be void ClangDocBitcodeWriter::emitInfoSet(const InfoSet &ISet) { then.

clang-doc/BitcodeWriter.h
129 ↗(On Diff #139869)

Isn't that the opposite of what was that assert supposed to do?
It would have been better to just // disable it, and add a FIXME note.

clang-doc/tool/ClangDocMain.cpp
138 ↗(On Diff #139869)

4096?

juliehockett marked 4 inline comments as done.

Small fixes to address comments

juliehockett added inline comments.Mar 27 2018, 12:12 PM
clang-doc/BitcodeReader.cpp
27 ↗(On Diff #139644)

Record[0] holds the size of the array -- in this case, 20 -- in the bitstream representation of the array (see the array section here). So, yes, it copies Record[1]...[20] to Field[0]...[19]. Not sure how that can be rewritten more clearly, but I'll add a clarifying comment!

clang-doc/BitcodeWriter.h
129 ↗(On Diff #139869)

I'm not sure I'm understanding you -- my understanding was that it existed to check that the record size was large enough.

clang-doc/Reducer.cpp
18 ↗(On Diff #139644)

Doesn't llvm::Optional imply that it *could* validly not exist though? This is trying to express here that that isn't a valid state.

clang-doc/Representation.h
135 ↗(On Diff #139644)

It's there (as is the CommentInfo one) because otherwise it was throwing errors. It appears that a copy constructor is called somewhere if this isn't specified.

lebedev.ri added inline comments.Mar 27 2018, 12:17 PM
clang-doc/BitcodeWriter.h
129 ↗(On Diff #139869)

https://reviews.llvm.org/D41102?id=136520#inline-384087

#ifndef NDEBUG // Don't want explicit dtor unless needed
~ClangDocBitcodeWriter() {
  // Check that the static size is large-enough.
  assert(Record.capacity() == BitCodeConstants::RecordSize);
}
#endif

I.e. it checked that it still only had the static size, and did not transform into normal vector with data stored on heap-allocated buffer.

lebedev.ri added inline comments.Mar 29 2018, 3:29 AM
clang-doc/BitcodeWriter.h
129 ↗(On Diff #139869)

Ok, let me use more words this time.
There are several storage container types with contiguous storage:

  • array - fixed-size stack storage, size == capacity == compile-time constant
  • vector - dynamic heap storage, size <= capacity
  • smallvector - fixed-size stack storage, and if the data does not fit, it degrades into the vector.

But there is also one more option:

  • fixedvector - which is a array - has only fixed-size stack storage (capacity == compile-time constant), but with vector-like interface, i.e. size <= capacity, and size can be changed at run-time.

In llvm, i'm not aware of existence of llvm::fixedvector, so that assert was ensuring that the smallvector behaved like the fixedvector - it only had the fixed-size stack storage, and 'did not have' the heap buffer.

This is a kinda ugly hack, but i think the current assert clearly does something different...

Does that make any sense?

clang-doc/Representation.cpp
28 ↗(On Diff #139972)

This was changed, but no test changes followed.
Needs more tests. The test coverage is clearly bad.

Fixing assert on vector size.

OK, I didn't get through all the detail, but we should talk about the MR stuff today :)

clang-doc/BitcodeReader.cpp
25 ↗(On Diff #140275)

why is this true?
i.e. why can't I produce a malformed bitcode file that causes this assertion to fail?
(My usual understanding is that a failed assert is meant to indicate a programming error, if it means something else here like "malformed file" you should probably document that if you're sure it's what you want)

42 ↗(On Diff #140275)

Similarly, what guarantees Record[0] is in-range for AccessSpecifier? casting an out-of-range integer to an enum results in UB.

clang-doc/BitcodeReader.h
11 ↗(On Diff #140275)

nit: *from* LLVM bitcode, or am I missing something?

29 ↗(On Diff #140275)

I found the implementation here a bit hard to understand at first, mostly because the state/invariants/responsibilities of each method weren't obvious to me.
Left some suggestions below to document these a bit more.

However many of these seem to in practice only share the Record buffer. It seems making this an explicit parameter would let you make these free functions, hiding them from the header and avoiding any possibility of spooky side-channel interactions. Up to you of course.

46 ↗(On Diff #140275)

I think this needs some elaboration :-)
e.g. Populates Record with the decoded record, and then calls one of the parseRecord overloads.

In general the data flow between these methods seems a little obscured by having Record be a field rather than a parameter as ID/Blob are.
(Incidentally I don't think you gain anything here from the use of SmallVector over std::vector, which is easier to type, but it doesn't really matter)

49 ↗(On Diff #140275)

comment for these methods e.g. Populates the appropriate field of the Info object based on a decoded record whose content is in Record.

59 ↗(On Diff #140275)

Comment for these like "decodes the blob part of a record" - or just rename these decodeBlob or similar?

clang-doc/Reducer.cpp
17 ↗(On Diff #140275)

A little bit of context to document here: the fact that ToolResults values are bit-encoded InfoSets describing each TU in the codebase. (I think?)

(Also, why not return InfoSet by value?)

23 ↗(On Diff #140275)

I don't see where any actual merging is happening.
Say we process TU 1 which has a decl of foo() with a doc comment, and TU 2 which defines foo(). Now we're going to have something like:
ToolResults = {
"tu1": "...bitcode...foo().../*does foo*/",
"tu2": "...bitcode...foo()...Foo.cpp:53",
}
the first readBitstream call is going to ultimately call StoreData(UniqueInfos->Functions), which appends a new function to the array
then the second readBitstream call will do the same thing, and you have an InfoSet with two copies of foo()

what am I missing?

clang-doc/Reducer.h
21 ↗(On Diff #140275)

This doesn't quite look like the typical reduce pattern. (And there's no framework or guidance for MRs in libtooling so this isn't surprising or maybe even a problem.

When the mapper emits its values, it usually chooses the keys so that a non-trivial merge is only required among values with the same key. (e.g. USR here)
This does two things:

  1. it makes it really easy to parallelize: after the mappers run in parallel, the data is grouped by key and all the reducers can run in parallel
  2. it simplifies the reducer code a lot, as it only has to deal with a single entity (e.g. function) at a time.

So conceptually instead of being a function(vector<InfoSet>) --> InfoSet, it's a function(USR, vector<FunctionInfo>) --> FunctionInfo.
The signature you've got here doesn't actually preclude 1), but it's unclear whether you expect it to be used this way. (i.e. do all the InfoSets being passed here actually contain only a single, identical entity?).

That said - does this actually matter? If you want to scale to large codebases, loading all your Infos into one InfoSet in memory is something to avoid. It's important to be able to process the symbols separately, both for parallelism and to avoid atomic tasks that are too big for one machine.
If you want to do this, then modelling the data unit as "one symbol of some kind" rather than "an arbitrary set of symbols" is much more natural. You no longer need "what is this symbol" in the value and to decode it over and over, instead you put it in the key.
So conceptually you have something like:
key="FUNCTION/::foo(int)", value = "COMMENT=does foo;PARAMS=[length]"
Your reducer merges the values for the same key without having to touch the key, and then the final documentation generator can run over the values inde

If you don't care about big codebases, I can't think of a strong reason to care. I like the conceptual framework of MR, but it's not necessary.

clang-doc/Representation.h
135 ↗(On Diff #139644)

you can define the move constructor as = default, and copy as = delete. The intent is clear, but you save some typing.

207 ↗(On Diff #140275)

This class is central, has a vague name, and is undocumented :-)

I gather this is the collection that describes the whole codebase. Something like CodebaseInfo might be a little more descriptive (though also misleading, as it doesn't inherit Info). But the name is OK as long as it's documented.

220 ↗(On Diff #140275)

So this implementation relies on InfoIndex and having the whole codebase in one InfoSet in memory.
If you want to scale past having everything in-memory, this needs to be part of the MR too.

Since you're at euroLLVM, let's find a chance to chat about it today, we should make a decision about whether "all fits in memory" is a simplifying assumption you want.

juliehockett marked 17 inline comments as done.

Reorganizing and streamlining, particularly in decoupling the reader from the reduce process and redesigning a bit to allow for more flexible reducing. Currently implements an in-memory reducer, but could (theoretically) be extended.

juliehockett marked 5 inline comments as done.Apr 25 2018, 3:35 PM
juliehockett added inline comments.
clang-doc/BitcodeReader.h
59 ↗(On Diff #140275)

In general they decode the record abbrev of the field, which is a single blob in many, but not all, cases. I've changed it to decodeRecord, but is too similar to the above?

clang-doc/BitcodeWriter.h
129 ↗(On Diff #139869)

Ah, yes. I see what you're trying to do, but the issue with that is it assumes that the record must be a certain size (or smaller). However, there is no guarantee of the size of the record (i.e. there may and will be records of variable size getting written), so it's not unreasonable to think that the record may need to be resized to accommodate that.

lebedev.ri added inline comments.Apr 25 2018, 3:38 PM
clang-doc/BitcodeWriter.h
129 ↗(On Diff #139869)

However, there is no guarantee of the size of the record (i.e. there may and will be records of variable size getting written), so it's not unreasonable to think that the record may need to be resized to accommodate that.

Yes, but *currently* that was correct assumption.
All variable-length records (well, blobs) are (were?) emitted directly, without using this intermediate Record container.

sammccall added inline comments.May 2 2018, 1:47 AM
clang-doc/Index.h
28 ↗(On Diff #144011)

What is this interface for? It looks like none of the functions are ever called through the interface.

If the intent is to abstract away a MR framework, I don't think this achieves that - multimachine MR frameworks have their "index" distributed and probably won't have a "reduce" function you can call with these semantics.

Consider just moving the in-memory reduction to the main file (which isn't portable across MR abstractions anyway) and avoiding abstractions there:

StringMap<std::vector<std::unique_ptr<Info>>>>  MapOutput;
// populate MapOutput as you currently do in inMemoryReduceResults
for (const auto &Group : MapOutput)
  Write(Group.first, reduceInfos(std::move(Group.second)));
clang-doc/Reducer.cpp
18 ↗(On Diff #144011)

TYPE is unused?

clang-doc/Representation.cpp
1 ↗(On Diff #144011)

this is important/subtle logic, and there are no comments (particularly about motivating how specific fields are merged)!

15 ↗(On Diff #144011)

why this, rather than actual assignment? (comments!)

20 ↗(On Diff #144011)

why would you ever assign() rather than move()

26 ↗(On Diff #144011)

why this pattern of "assign if empty" for these types?

39 ↗(On Diff #144011)

what if there are duplicatel?

53 ↗(On Diff #144011)

is plain concatenation of comments what you want?

clang-doc/Representation.h
75 ↗(On Diff #144011)

consider std::tie(A, B) == std::tie(Other.A, Other.B)

79 ↗(On Diff #144011)

do you really need !=?
if so, prefer to define it as !(*this == Other) to avoid redundancy.

159 ↗(On Diff #144011)

should this be a copy constructor? (reference should be const)

162 ↗(On Diff #144011)

So having this "overriding" that's not virtual seems like asking for trouble: if you accidentally deref a unique_ptr<Info> and compare it, you'll get the base behavior which should probably never be called directly.

Consider naming this mergeBase and making it protected.

juliehockett marked 17 inline comments as done.

Cleaning up and clarifying the merging process, and addressing comments

juliehockett added inline comments.May 7 2018, 10:37 AM
clang-doc/Representation.cpp
53 ↗(On Diff #144011)

Yes, in this case, as they aren't comment strings but vectors of comment infos.

This will break things in clang-tools-extra without D46615, so I'm going to hold off landing this until that goes through

This will break things in clang-tools-extra without D46615, so I'm going to hold off landing this until that goes through

Oops wrong patch disregard

Ping....any more thoughts?

The reduce logic seems to be in a good shape. Some nits and questions inlined.

clang-doc/Reducer.cpp
19 ↗(On Diff #144948)

It seems that you could use a template function?

24 ↗(On Diff #144948)

Do we really want to give up all infos when one bad info/merge is present?

clang-doc/Representation.h
138 ↗(On Diff #144948)

Shouldn't USR be SymbolID() by default?

146 ↗(On Diff #144948)

It's a bit awkward that users have to dispatch from info types to the corresponding merge function (as in Reducer.cpp). I think it would make users' life easier if the library handles the dispatching.

I wonder if something like the following would be better:

struct Info {
    std::unique_ptr<Info> merge(const Indo& LHS, const Info& RHS);
};
// A standalone function.
std::unique_ptr<Info> mergeInfo(const Info &LHS, const Info& RHS) {
  // merge base info.
  ...
  // merge subclass infos.
  assert(LHS.IT == RHS.IT); // or return nullptr
  switch (LHS.IT) {
   ... 
    return Namespace::merge(LHS, RHS);
  } 
}

struct NamespaceInfo : public Info {
  std::unique_ptr<Info> merge(LHS, RHS);
};

The unhandled switch case warning in compiler would help you catch unimplemented merge when new info types are added.

clang-doc/tool/ClangDocMain.cpp
60 ↗(On Diff #144948)

If this only dumps the intermediate results, should we call it something like dump-intermediate instead, to avoid confusion with final results?

148 ↗(On Diff #144948)

Print an error message on error?

154 ↗(On Diff #144948)

Could you add a brief comment explaining what key and value are?

161 ↗(On Diff #144948)

nit: remove debug logging?

170 ↗(On Diff #144948)

nit: maybe also output the number of entries collected in the map?

181 ↗(On Diff #144948)

(Sorry that I might be missing context here.)

Could you please explain the incentive for dumping each info group to one bc file? If the key identifies a symbol (e.g. USR), wouldn't this result in a bitcode file being created for each symbol? This doesn't seem very scalable.

juliehockett marked 11 inline comments as done.

Reworking the reducer interface a bit to address comments.

juliehockett added inline comments.May 25 2018, 5:36 PM
clang-doc/Representation.h
138 ↗(On Diff #144948)

It actually is garbage by default -- doing this zeroed it out.

146 ↗(On Diff #144948)

Sort of addressed in this update. There's an issue with where we allocate the return pointer, because we need to know the type of info at allocation time -- let me know if what's here now is too far off of what you were suggesting.

clang-doc/tool/ClangDocMain.cpp
181 ↗(On Diff #144948)

Yes, it would. This is mostly for debugging, since there's not really any tools outside the clang system that would actually want/be able to use this information.

ioeric added inline comments.May 28 2018, 10:10 AM
clang-doc/BitcodeReader.cpp
553 ↗(On Diff #148678)

Convert this to a template function?

clang-doc/Reducer.cpp
19 ↗(On Diff #148678)

Now that merging is implemented in the info library, reduceInfos doesn't seem to be necessary.

I'd suggest moving writeInfo closer to the bitcode writer library and get rid of the Reducer.h/cpp.

23 ↗(On Diff #148678)

I think a more canonical form to pass in an output buffer is via llvm::raw_ostream, and then callers can use llvm::raw_string_ostream.

clang-doc/Representation.cpp
30 ↗(On Diff #148678)

Macros are generally discouraged. Could you make this a function in Info e.g. Info::mergeable(const Info &Other). And sub-classes can simply can asssert(mergeable(Other)).

59 ↗(On Diff #148678)

Use llvm::Expected<std::unique_ptr<Info>> (e.g. llvm::make_error<StringError>(...)) for error handling and let callers decide how to handle the error (e.g. llvm::errs() << llvm::toString(Err)).

clang-doc/Representation.h
216 ↗(On Diff #148678)

Please elaborate a bit on the contract e.g. all values should have the same type; otherweise ...

217 ↗(On Diff #148678)

nit: s/Value/Values/ or Infos

146 ↗(On Diff #144948)

Thanks!

I thin what you have is also fine, as long as it's easy to maintain when adding new info types.

clang-doc/tool/ClangDocMain.cpp
181 ↗(On Diff #144948)

Ok, is there any plan to convert intermediate result to final result and output in a more accessible format? If so, please add a FIXME somewhere just to be clearer.

176 ↗(On Diff #148678)

Rather then replace the values in MapOutput, it would probably be clearer if you store the reduced infos into a different container e.g. Reduced.

docs/ReleaseNotes.rst
61 ↗(On Diff #148678)

Are these changes relevant to this patch?

juliehockett marked 13 inline comments as done.

Fixing comments

clang-doc/tool/ClangDocMain.cpp
181 ↗(On Diff #144948)

That's what the next patch set is (to generate YAML).

ioeric accepted this revision.Jun 1 2018, 1:14 AM

lgtm with just a few more nits.

clang-doc/BitcodeWriter.cpp
484 ↗(On Diff #149371)

Nit: EMITINFO is a bit confusing with writeInfo. Are they doing the same thing (emit sounds like write)? You might want to rename either of them. While you are here, it might also make sense to clear up the MACRO :)

clang-doc/Representation.cpp
57 ↗(On Diff #149371)

nit: std::error_code() is usually used to indicate no error. You could use llvm::inconvertibleErrorCode().

clang-doc/tool/ClangDocMain.cpp
181 ↗(On Diff #144948)

Sure. I think a FIXME should be in place if the feature is not already there.

68 ↗(On Diff #149371)

nit: s/current option/default option/

68 ↗(On Diff #149371)
178 ↗(On Diff #149371)

Shouldn't we continue to the next group if reduction goes wrong for the current group?

This revision is now accepted and ready to land.Jun 1 2018, 1:14 AM
juliehockett marked 8 inline comments as done.
This revision was automatically updated to reflect the committed changes.