Page MenuHomePhabricator

Setup clang-doc frontend framework
ClosedPublic

Authored by juliehockett on Dec 11 2017, 5:36 PM.

Details

Summary

Setting up the mapper part of the frontend framework for a clang-doc tool. It creates a series of relevant matchers for declarations, and uses the ToolExecutor to traverse the AST and extract the matching declarations and comments. The mapper serializes the extracted information to individual records for reducing and eventually doc generation.

For a more detailed overview of the tool, see the design document on the mailing list: RFC: clang-doc proposal

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
juliehockett marked 14 inline comments as done.

Fixing comments and adding tests

Thank you for working on this!
Some more nitpicking.

Please consider adding even more tests (ideally, all this code should have 100% test coverage)

clang-doc/BitcodeWriter.cpp
139 ↗(On Diff #136520)

This change is not covered by tests.
(I've actually found out that the hard way, by trying to find why it didn't trigger any asssertions, oh well)

325 ↗(On Diff #136520)

I think it would be cleaner to move it (at least the enterblock, it might make sense to leave the header at the very top) after the static variable

363 ↗(On Diff #136520)

I.e.

...
, FUNCTION_IS_METHOD}}};

  Stream.EnterBlockInfoBlock();
  for (const auto &Block : TheBlocks) {
    assert(Block.second.size() < (1U << BitCodeConstants::SubblockIDSize));
    emitBlockInfo(Block.first, Block.second);
  }
  Stream.ExitBlock();

  emitVersion();
}
clang-doc/BitcodeWriter.h
19 ↗(On Diff #136520)

Please sort includes, clang-tidy complains.

32 ↗(On Diff #136520)
/build/clang-tools-extra/clang-doc/BitcodeWriter.h:32:23: warning: invalid case style for variable 'VERSION_NUMBER' [readability-identifier-naming]
static const unsigned VERSION_NUMBER = 1;
                      ^~~~~~~~~~~~~~
                      VersionNumber
163 ↗(On Diff #136520)

The simplest solution would be

#ifndef NDEBUG // Don't want explicit dtor unless needed
~ClangDocBitcodeWriter() {
  // Check that the static size is large-enough.
  assert(Record.capacity() == BitCodeConstants::RecordSize);
}
#endif
228 ↗(On Diff #136520)

So you want to be really definitive with this. I wanted to avoid that, actually..
Then i'm afraid one more assert is needed, to make sure this is *actually* true.

I'm not seeing any way to make SmallVector completely static,
so you could either add one more wrapper around it (rather ugly),
or check the final size in the ClangDocBitcodeWriter destructor (will not pinpoint when the size has 'overflowed')

246 ↗(On Diff #136520)

Does it *ever* make sense to output BlockInfoBlock anywhere else other than once at the very beginning?
I'd think you should drop the boolean param, and unconditinally call the emitBlockInfoBlock(); from ClangDocBitcodeWriter::ClangDocBitcodeWriter() ctor.

248 ↗(On Diff #136520)

The naming choices confuse me.
There is writeBitstream() and emitBlock(), which is called from writeBitstream() to write the actual contents of the block.

Why one is write and another is emit?
To match the BitstreamWriter naming choices? (which uses Emit prefix)?
To avoid the confusion of which one outputs the actual content, and which one outputs the whole block?

I think it should be:

- void emitBlock(const NamespaceInfo &I);
+ void emitBlockContent(const NamespaceInfo &I);
- void ClangDocBitcodeWriter::writeBitstream(const T &I, bool WriteBlockInfo);
+ void ClangDocBitcodeWriter::emitBlock(const T &I, bool EmitBlockInfo);

This way, *i think* their names would clearner-er state what they do, and won't be weirdly different.
What do you think?

clang-doc/Representation.h
18 ↗(On Diff #136520)

Please sort includes, clang-tidy complains.

clang-doc/Serialize.cpp
88 ↗(On Diff #136520)
/build/clang-tools-extra/clang-doc/Serialize.cpp:88:17: warning: invalid case style for variable 'i' [readability-identifier-naming]
  for (unsigned i = 0, e = C->getNumArgs(); i < e; ++i)
                ^                           ~        ~~
                I                           I        I
/build/clang-tools-extra/clang-doc/Serialize.cpp:88:24: warning: invalid case style for variable 'e' [readability-identifier-naming]
  for (unsigned i = 0, e = C->getNumArgs(); i < e; ++i)
                       ^                        ~~
                       E                        E
107 ↗(On Diff #136520)
/build/clang-tools-extra/clang-doc/Serialize.cpp:107:19: warning: invalid case style for variable 'i' [readability-identifier-naming]
    for (unsigned i = 0, e = C->getDepth(); i < e; ++i)
                  ^                         ~        ~~
                  I                         I        I
/build/clang-tools-extra/clang-doc/Serialize.cpp:107:26: warning: invalid case style for variable 'e' [readability-identifier-naming]
    for (unsigned i = 0, e = C->getDepth(); i < e; ++i)
                         ^                      ~~
                         E                      E
clang-doc/Serialize.h
19 ↗(On Diff #136520)

Please sort includes, clang-tidy complains.

clang-doc/tool/ClangDocMain.cpp
80 ↗(On Diff #136520)

Why at the beginning though?
Couldn't the user pass -extra-arg=-fno-parse-all-comments, which could override this?

juliehockett marked 16 inline comments as done.

Adding tests, fixing comments, and removing an (as-of-yet) unused element of the CommentInfo struct.

clang-doc/BitcodeWriter.cpp
139 ↗(On Diff #136520)

So after a some digging, this particular field can't be tested right now as the mapper doesn't look at any TemplateDecls (something that definitely needs to be implemented, but in a follow-on patch). I've removed it for now, until it can be properly used/tested.

196 ↗(On Diff #135682)

If that makes sense to you, sounds good to me!

clang-doc/BitcodeWriter.h
37 ↗(On Diff #136161)

Figured it out -- the Reference struct didn't have default for the enum, and so if it wasn't initialized it was undefined. Should be fixed now.

test/clang-doc/mapper-enum.cpp
17 ↗(On Diff #136303)

Not currently -- I'm planning to add that functionality in the future, but right now it ignores typedef or using decls.

Could some other people please review this differential, too?
I'm sure i have missed things.


Some more nitpicking.

For this differential as standalone, i'we mostly run out of things to nitpick.
Some things can probably be done better (the blockid/recordid stuff could probably be nicer if tablegen-ed, but that is for later).

I'll try to look at the next differential, and at them combined.

clang-doc/BitcodeWriter.cpp
120 ↗(On Diff #136650)

We don't actually push these strings to the Record (but instead output them directly), so this assertion is not really meaningful, i think?

clang-doc/BitcodeWriter.h
21 ↗(On Diff #136650)

+DenseMap

21 ↗(On Diff #136650)

+StringRef

197 ↗(On Diff #136650)

Humm, you could avoid this constant, and conserve a few bits, if you move the init-list out of emitBlockInfoBlock() to somewhere e.g. after the enum RecordId, and then since the BlockId ID is already passed, you could compute it on-the-fly the same way the BitCodeConstants::SubblockIDSize is asserted in emitBlockInfo*().
Not sure if it's worth doing though. Maybe just add it as a NOTE here.

249 ↗(On Diff #136650)

Stale comment

clang-doc/Representation.h
60 ↗(On Diff #136650)

Info *Ref; isn't used anywhere

117 ↗(On Diff #136650)

llvm::Optional<Location> DefLoc; ?

juliehockett marked 11 inline comments as done.

Addressing comments

lebedev.ri added inline comments.Mar 2 2018, 10:38 AM
clang-doc/Representation.h
117 ↗(On Diff #136791)

I meant that IsDefinition controls whether DefLoc will be set/used or not.
So with llvm::Optional<Location> DefLoc, you don't need the bool IsDefinition.

juliehockett marked an inline comment as done.

Removing IsDefinition field.

clang-doc/Representation.h
117 ↗(On Diff #136791)

That...makes so much sense. Oops. Thank you!

Eugene.Zelenko added inline comments.Mar 5 2018, 6:15 PM
clang-doc/BitcodeWriter.h
160 ↗(On Diff #136809)

Looks like Clang-format was applied incorrectly, because this is Google, not LLVM style. Please note that it doesn't modify file, just output formatted code to terminal.

Please reformat other files, including those in dependent patches.

My apologies for getting back on this so late!

So, as an idea (as this diff implements), I updated the string references to be a struct, which holds the USR of the referenced type (for serialization, both here in the mapper and for the dump option in the reducer, as well as a pointer to an Info struct. This pointer is not used at this point, but would be populated by the reducer. Thoughts?

This seems like quite a decent approach! That being said, I don't see the pointer yet? I assume you mean that you will be adding this? Additionally, a slight disadvantage of doing this generic approach is that you need to do bookkeeping on what it is referencing, but I guess there's no helping that due to the architecture which makes you rely upon the USR? Personally I'd prefer having the explicit types if and where possible. So for now a RecordInfo has a vecotr of Reference's to its parents, but we know the parents can only be of certain kinds (more than just a RecordType, but you get the point); it won't be an enum, namespace or function.

As I mentioned, we did this the other way around, which also has the slight advantage that I only had to create and save the USR once per info instance (as in, 10 references to a class only add the overhead of 10 pointers, rather than each having the USR as well), but our disadvantage was of course that we had delayed serialization (although we could arguably do both simultaneously). It seems each method has its merits :).

This seems like quite a decent approach! That being said, I don't see the pointer yet? I assume you mean that you will be adding this? Additionally, a slight disadvantage of doing this generic approach is that you need to do bookkeeping on what it is referencing, but I guess there's no helping that due to the architecture which makes you rely upon the USR? Personally I'd prefer having the explicit types if and where possible. So for now a RecordInfo has a vecotr of Reference's to its parents, but we know the parents can only be of certain kinds (more than just a RecordType, but you get the point); it won't be an enum, namespace or function.

If you take a look at the follow-on patch to this (D43341), you'll see that that is where the pointer is added in (since it is irrelevant to the mapper portion, as it cannot be filled out until the information has been reduced). The back references to children and whatnot are also added there.

As I mentioned, we did this the other way around, which also has the slight advantage that I only had to create and save the USR once per info instance (as in, 10 references to a class only add the overhead of 10 pointers, rather than each having the USR as well), but our disadvantage was of course that we had delayed serialization (although we could arguably do both simultaneously). It seems each method has its merits :).

The USRs are kept for serialization purposes -- given the modular nature of the design, the goal is to be able to write out the bitstream and have it be consumable with all necessary information. Since we can't write out pointers (and it would be useless if we did, since they would change as soon as the file was read in), we maintain the USRs to have a means of re-finding the referenced declaration.

That said, I was looking at the Clangd symbol indexing code yesterday, and noticed that they're hashing the USRs (since they get a little lengthy, particularly when you have nested and/or overloaded functions). I'm going to take a look at that today to try to make the USRs more space-efficient here.

Adding hashing to reduce the size of USRs and updating tests.

Nice!
Some further notes based on the SHA1 nature.

clang-doc/BitcodeWriter.cpp
74 ↗(On Diff #137244)

Those are mixed up.
USRLengthSize is definitively supposed to be second.

81 ↗(On Diff #137244)

The sha1 is all-printable, so how about using BitCodeAbbrevOp::Encoding::Char6 ?
Char4 would work best, but it is not there.

149 ↗(On Diff #137244)

Ha, and all the *_USR are actually StringAbbrev's, not confusing at all :)

309 ↗(On Diff #137244)

Now it would make sense to also assert that this sha1(usr).strlen() == 20

clang-doc/BitcodeWriter.h
46 ↗(On Diff #137244)

Can definitively lower this to 5U (2^6 == 32, which is more than the 20 8-bit chars of sha1)

clang-doc/Representation.h
59 ↗(On Diff #137244)

Now that USR is sha1'd, this is always 20 8-bit characters long.

107 ↗(On Diff #137244)

20
Maybe place using USRString = SmallString<20>; // SHA1 of USR somewhere and use it everywhere?

Athosvk added a comment.EditedMar 7 2018, 3:03 AM

If you take a look at the follow-on patch to this (D43341), you'll see that that is where the pointer is added in (since it is irrelevant to the mapper portion, as it cannot be filled out until the information has been reduced). The back references to children and whatnot are also added there.

Oops! I'll have a look!

The USRs are kept for serialization purposes -- given the modular nature of the design, the goal is to be able to write out the bitstream and have it be consumable with all necessary information. Since we can't write out pointers (and it would be useless if we did, since they would change as soon as the file was read in), we maintain the USRs to have a means of re-finding the referenced declaration.

What I was referring to was the storing of a USR per reference. Of course, serializing pointers wouldn't work, but what I mean is that what we used as a USR was stored in what was pointed to, not in the reference that tells what we are pointing to. To be a little more concise, a RecordInfo has pointers to the FuntionInfo for its member functions. Upon serialization, the RecordInfo queries the USR of those functions. A function being referenced multiple times remains to only have the USR stored. If I understand correctly, you currently save the USR for time an InfoType references another InfoType.

Anyhow, don't pay too much attention to that comment, it's all meant as a minor thing. It sure is looking good so far!

Some further notes based on the SHA1 nature.

I'm sorry, brainfreeze, i meant 40 chars, not 20.
Updated comments...

clang-doc/BitcodeWriter.cpp
309 ↗(On Diff #137244)

40 that is

clang-doc/BitcodeWriter.h
46 ↗(On Diff #137244)

Edit: to 6U (2^6 == 64, which is more than the 40 8-bit chars of sha1)

clang-doc/Representation.h
59 ↗(On Diff #137244)

40 that is

107 ↗(On Diff #137244)

40

juliehockett marked 13 inline comments as done.

Updating bitcode writer for hashed USRs, and re-running clang-format. Also cleaning up a couple of unused fields.

Hmm, i'm missing something about the way store sha1...

clang-doc/BitcodeWriter.cpp
53 ↗(On Diff #137457)

This is VBR because USRLengthSize is of such strange size, to conserve the bits?

57 ↗(On Diff #137457)

Looking at the NumWords changes (decrease!) in the tests, and this is bugging me.
And now that i have realized what we do with USR:

  • we first compute SHA1, and get 20x uint8_t
  • store/use it internally
  • then hex-ify it, getting 40x char (assuming 8-bit char)
  • then convert to char6, winning back two bits. but we still loose 2 bits.

Question: *why* do we store sha1 of USR as a string?
Why can't we just store that USRString (aka USRSha1 binary) directly?
That would be just 20 bytes, you just couldn't go any lower than that.

clang-doc/Representation.h
29 ↗(On Diff #137457)

Right, of course, internally this is kept in the binary format, which is just 20 chars.
This is not the string (the hex-ified version of sha1), but the raw sha1, the binary.
This should somehow convey that. This should be something closer to USRSha1.

sammccall accepted this revision.Mar 8 2018, 4:51 PM

There's a few places where we can trim some of the boilerplate, which I think is important - it's hard to find the "real code" among all the plumbing in places.
Other than that, this seems OK to me.

clang-doc/BitcodeWriter.h
116 ↗(On Diff #137457)

I think you don't want to declare ID in the unspecialized template, so you get a compile error if you try to use it.

(Using traits for this sort of thing seems a bit overboard to me, but YMMV)

154 ↗(On Diff #137457)

Hmm, you spend a lot of effort plumbing this variable around! Why is it so important?
Filesize? (I'm not that familiar with LLVM bitcode, but surely we'll end up with a string table anyway?)

If it really is an important option people will want, the command-line arg should probably say why.

241 ↗(On Diff #137457)

OK, I don't get this at all.

We have to declare emitBlockContent(NamespaceInfo) *and* the specialization of MapFromInfoToBlockId<NamespaceInfo>, and deal with the public interface emitBlock being a template function where you can't tell what's legal to pass, instead of writing:

void emitBlock(const NamespaceInfo &I) {
  SubStreamBlockGuard Block(Stream, BI_NAMESPACE_BLOCK_ID); // <-- this one line
  ...
}

This really seems like templates for the sake of templates :(

clang-doc/ClangDoc.h
10 ↗(On Diff #137457)

This comment doesn't seem accurate - there's no main() in this file.
There's a FrontendActionFactory, but nothing in this file uses it.

37 ↗(On Diff #137457)

nit: seems odd to put all this implementation in the header.
(personally I'd just expose a function returning unique_ptr<FrontendActionFactory> from the header, but up to you...)

38 ↗(On Diff #137457)

for ASTConsumers implemented by ASTVisitors, there seems a fairly strong convention to just make the same class extend both (MapASTVisitor, here).
That would eliminate one plumbing class...

clang-doc/Mapper.cpp
33 ↗(On Diff #137457)

It seems a bit of a poor fit to use a complete bitcode file (header, version, block info) as your value format when you know the format, and know there'll be no version skew.
Is it easy just to emit the block we care about?

clang-doc/Representation.h
29 ↗(On Diff #137457)

I'm not sure that any of the implementation (either USR or SHA) belongs in the type name.
In clangd we called this type SymbolID, which seems like a reasonable name here too.

44 ↗(On Diff #137457)

this is probably the right place to document these fields - what are the legal kinds? what's the name of a comment, direction, etc?

This revision is now accepted and ready to land.Mar 8 2018, 4:51 PM
This revision was automatically updated to reflect the committed changes.
juliehockett marked 11 inline comments as done.

Might have been better to not start landing until the all differentials are understood/accepted, but i understand that it is not really up to me to decide.
Let's hope nothing in the next differentials will require changes to this initial code :)

clang-doc/BitcodeWriter.h
241 ↗(On Diff #137457)

If you want to add a new block, in one case you just need to add one

template <> struct MapFromInfoToBlockId<???Info> {
  static const BlockId ID = BI_???_BLOCK_ID;
};

In the other case you need to add whole

void ClangDocBitcodeWriter::emitBlock(const ???Info &I) {
  StreamSubBlockGuard Block(Stream, BI_???_BLOCK_ID);
  emitBlockContent(I);
}

(and it was even longer initially)
It seems just templating one static variable is shorter than duplicating emitBlock() each time, no?

Do compare the current diff with the original diff state.
I *think* these templates helped move much of the duplication to simplify the code overall.

Since the commit was reverted, did you mean to either recommit it, or reopen this (with updated diff), so it does not get lost?

Since the commit was reverted, did you mean to either recommit it, or reopen this (with updated diff), so it does not get lost?

Relanded in r327295.

clang-doc/BitcodeWriter.h
154 ↗(On Diff #137457)

It was for testing purposes (so that the tests aren't flaky on filenames), but I replaced it with regex.

241 ↗(On Diff #137457)

You'd still have to add the appropriate emitBlock() function for any new block, since it would have different attributes.

clang-doc/Mapper.cpp
33 ↗(On Diff #137457)

Ideally, yes, but right now in the clang BitstreamWriter there's no way to tell the instance what all the abbreviations are without also emitting the blockinfo to the output stream, though I'm thinking about taking a stab at separating the two.

Also, this relies on the llvm-bcanalyzer for testing, which requires both the header and the blockinfo in order to read the data :/

lebedev.ri added inline comments.Mar 14 2018, 1:44 PM
clang-doc/BitcodeWriter.cpp
230 ↗(On Diff #136303)

And https://github.com/mattgodbolt/compiler-explorer/issues/841 is done,
so now we can see that SmallVector::append() at least results in less code:
https://godbolt.org/g/xJQ59c

So what part is failing, specifically?
The SHA1 blobs of USR's differ in the llvm-bcanalyzer dumps?
The actual filenames %t/docs/bc/<sha1-to-text> differ?
I guess both?

First one you should be able to handle by replacing the actual values with a regex
(i'd guess <USR abbrevid=4 op0=20 op1=11 <...> op19=226 op20=232/> -> <USR abbrevid=4 .*/>, but did not try)
I'm not sure we care about the actual values here, do we?

Second one is interesting.
If we assume that the order in which those are generated is the same, which i think is a safer assumption,
then you could just use result id, not key (sha1-to-text of USR), i.e. %t/docs/bc/00.bc, %t/docs/bc/01.bc and so on.
I.e. something like:

  if (DumpMapperResult) {
+   unsigned id = 0;
    Exec->get()->getToolResults()->forEachResult([&](StringRef Key,
                                                     StringRef Value) {
      SmallString<128> IRRootPath;
      llvm::sys::path::native(OutDirectory, IRRootPath);
      llvm::sys::path::append(IRRootPath, "bc");
      std::error_code DirectoryStatus =
          llvm::sys::fs::create_directories(IRRootPath);
      if (DirectoryStatus != OK) {
        llvm::errs() << "Unable to create documentation directories.\n";
        return;
      }
-     llvm::sys::path::append(IRRootPath, Key + ".bc");
+     llvm::sys::path::append(IRRootPath, std::to_string(id) + ".bc");
      std::error_code OutErrorInfo;
      llvm::raw_fd_ostream OS(IRRootPath, OutErrorInfo, llvm::sys::fs::F_None);
      if (OutErrorInfo != OK) {
        llvm::errs() << "Error opening documentation file.\n";
        return;
      }
      OS << Value;
      OS.close();
+     id++;
    });
  }

Hm, or possibly you could just pass the triple to clang?

I was just thinking of disabling the one test that has an issue (class-in-function) on Windows -- the filename is only used in generating *some* USRs, so all of the other ones are fine. We ran into some issues with that though, since UNSUPPORTED: system-windows didn't seem to disable the test on the machine I have access to. Thoughts?

I was just thinking of disabling the one test that has an issue (class-in-function) on Windows -- the filename is only used in generating *some* USRs, so all of the other ones are fine. We ran into some issues with that though, since UNSUPPORTED: system-windows didn't seem to disable the test on the machine I have access to. Thoughts?

UNSUPPORTED: system-windows

Perhaps that is only for msvc?

Have you tried something more broad, like
UNSUPPORTED: mingw32,win32
?

Have you tried something more broad, like
UNSUPPORTED: mingw32,win32
?

That wasn't working either, confusingly, at least on the local windows machine I have.

Huh, something weird is going on there.
What about the other way around, REQUIRES: linux ?

After much digging, it looks like the lit config is never initialized in clang-tools-extra like it is in the other projects. REQUIRES et.al. work properly once that's in there (see D44708). Once that lands I'll reland this and *hopefully* that'll be that!