This is an archive of the discontinued LLVM Phabricator instance.

Setup clang-doc frontend framework
ClosedPublic

Authored by juliehockett on Dec 11 2017, 5:36 PM.

Details

Summary

Setting up the mapper part of the frontend framework for a clang-doc tool. It creates a series of relevant matchers for declarations, and uses the ToolExecutor to traverse the AST and extract the matching declarations and comments. The mapper serializes the extracted information to individual records for reducing and eventually doc generation.

For a more detailed overview of the tool, see the design document on the mailing list: RFC: clang-doc proposal

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Refactoring bitcode writer

Next, i suggest to look into code self-debugging, see comments.
Also, i have added a few questions, it would be great to know that my understanding is correct?

I'm sorry that it seems like we are going over and over and over over the same code again,
this is the very base of the tool, i think it is important to get it as close to great as possible.
I *think* these review comments move it in that direction, not in the opposite direction?

clang-doc/BitcodeWriter.cpp
47 ↗(On Diff #135559)

So in other words this is making an assumption that no file with more than 65535 lines will be analyzed, correct?
Can you add that as comment please?

56 ↗(On Diff #135559)
AbbrevDsc Abbrev = nullptr;
57 ↗(On Diff #135559)
// Is this 'description' valid?
operator bool() const {
  return Abbrev != nullptr && Name.data() != nullptr && !Name.empty();
}
137 ↗(On Diff #135559)

So FUNCTION_MANGLED_NAME is phased out, and is thus missing, as far as i understand?

148 ↗(On Diff #135559)

+assert(RecordIdNameMap[ID] && "Unknown Abbreviation");

153 ↗(On Diff #135559)

+assert(RecordIdNameMap[ID] && "Unknown Abbreviation");

158 ↗(On Diff #135559)

Called only once, and that call does nothing.
I'd drop it.

175 ↗(On Diff #135559)
/// \brief Emits a block ID and the block name to the BLOCKINFO block.
void ClangDocBitcodeWriter::emitBlockID(BlockId ID) {
  const auto& BlockIdName = BlockIdNameMap[ID];
  assert(BlockIdName.data() && BlockIdName.size() && "Unknown BlockId!");

  Record.clear();
  Record.push_back(ID);
  Stream.EmitRecord(llvm::bitc::BLOCKINFO_CODE_SETBID, Record);

  Record.clear();
  for (const char C : BlockIdName) Record.push_back(C);
  Stream.EmitRecord(llvm::bitc::BLOCKINFO_CODE_BLOCKNAME, Record);
}
187 ↗(On Diff #135559)
/// \brief Emits a record name to the BLOCKINFO block.
void ClangDocBitcodeWriter::emitRecordID(RecordId ID) {
  assert(RecordIdNameMap[ID] && "Unknown Abbreviation");
  prepRecordData(ID);

(Yes, prepRecordData() will have the same code. It should get optimized away.)

194 ↗(On Diff #135559)
void ClangDocBitcodeWriter::emitAbbrev(RecordId ID, BlockId Block) {
  assert(RecordIdNameMap[ID] && "Unknown Abbreviation");
  auto Abbrev = std::make_shared<BitCodeAbbrev>();
204 ↗(On Diff #135559)

So remember that in a previous iteration, seemingly useless AbbrevDsc stuff was added to the RecordIdNameMap?
It is going to pay-off now:

void ClangDocBitcodeWriter::emitRecord(StringRef Str, RecordId ID) {
  assert(RecordIdNameMap[ID] && "Unknown Abbreviation");
  assert(RecordIdNameMap[ID].Abbrev == &StringAbbrev && "Abbrev type mismatch");
  if (!prepRecordData(ID, !Str.empty())) return;
...

And if we did not add an RecordIdNameMap entry for this RecordId, then i believe that will also be detected because Abbrev will be a nullptr.

205 ↗(On Diff #135559)
assert(Str.size() < (1U << BitCodeConstants::StringLengthSize));
Record.push_back(Str.size());
210 ↗(On Diff #135559)
void ClangDocBitcodeWriter::emitRecord(const Location &Loc, RecordId ID) {
  assert(RecordIdNameMap[ID] && "Unknown Abbreviation");
  assert(RecordIdNameMap[ID].Abbrev == &LocationAbbrev && "Abbrev type mismatch");
  if (!prepRecordData(ID, !OmitFilenames)) return;
...
211 ↗(On Diff #135559)

Call me paranoid, but:

assert(Loc.LineNumber < (1U << BitCodeConstants::LineNumberSize));
Record.push_back(Loc.LineNumber);
assert(Loc.Filename.size()) < (1U << BitCodeConstants::StringLengthSize));
Record.push_back(Loc.Filename.size());
217 ↗(On Diff #135559)
void ClangDocBitcodeWriter::emitRecord(int Val, RecordId ID) {
  assert(RecordIdNameMap[ID] && "Unknown Abbreviation");
  assert(RecordIdNameMap[ID].Abbrev == &IntAbbrev && "Abbrev type mismatch");
  if (!prepRecordData(ID, Val)) return;
218 ↗(On Diff #135559)
assert(Val < (1U << BitCodeConstants::IntSize));
Record.push_back(Val);
222 ↗(On Diff #135559)
bool ClangDocBitcodeWriter::prepRecordData(RecordId ID, bool ShouldEmit) {
  assert(RecordIdNameMap[ID] && "Unknown Abbreviation");
  if (!ShouldEmit) return false;
232 ↗(On Diff #135559)

Since ClangDocBitcodeWriter is not re-used, but re-constructed* each time, Abbrevs.clear(); does nothing.

  • Hmm, i wonder if that will be a bad thing. Benchmarking will tell i guess :/
236 ↗(On Diff #135559)

https://godbolt.org/g/rD6BWK also suggests it should be static const

276 ↗(On Diff #135559)

Uhm, do you plan on calling emitBlockInfo() from anywhere else other than emitBlockInfoBlock()?
Since it takes const std::vector<RecordId> instead of a const std::initializer_list<RecordId>&, a memory copy will happen...
https://godbolt.org/g/rD6BWK

clang-doc/BitcodeWriter.h
35 ↗(On Diff #135559)

LineNumFixedSize is used for a different things. Given such a specific name, i think it may be confusing?
Also, looking at http://llvm.org/doxygen/classllvm_1_1BitstreamWriter.html#ae6a40b4a5ea89bb8b5076c26e0d0b638 i guess these all should be unsigned.

I think this would be better, albeit more verbose:

struct BitCodeConstants {
  static constexpr unsigned SignatureBitSize = 8U;
  static constexpr unsigned SubblockIDSize = 5U;
  static constexpr unsigned IntSize = 16U;
  static constexpr unsigned StringLengthSize = 16U;
  static constexpr unsigned LineNumberSize = 16U;
};
53 ↗(On Diff #135559)

So what *exactly* does BitCodeConstants::SubblockIDSize mean?

static_assert(BI_LAST < (1U << BitCodeConstants::SubblockIDSize), "Too many block id's!");

?

94 ↗(On Diff #135559)

So i have a question: if something (FUNCTION_MANGLED_NAME in this case) is phased out, does it have to stay in this enum?
That will introduce holes in RecordIdNameMap.
Are the actual numerical id's of enumerators stored in the bitcode, or the string (abbrev, RecordIdNameMap[].Name)?

Looking at tests, i guess these enums are internal detail, and they can be changed freely, including removing enumerators.
Am i wrong?

I think that should be explained in a comment before this enum.

100 ↗(On Diff #135559)

If AbbreviationMap comment makes sense, i guess that common code should be moved here, i.e.

static constexpr unsigned RecordIdCount = RI_LAST - RI_FIRST + 1;

and use this new variable in those two places.

163 ↗(On Diff #135559)

We know we will have at most RI_LAST - RI_FIRST + 1 abbreviations.
Right now that results in just ~40 abbreviations.
Would it make sense to

AbbreviationMap() : Abbrevs(RI_LAST - RI_FIRST + 1) {}

?
(or llvm::DenseMap<unsigned, unsigned> Abbrevs = llvm::DenseMap<unsigned, unsigned>(RI_LAST - RI_FIRST + 1); but that looks uglier to me..)

The change to USR seems like quite an improvement already! That being said, I do think that it might be preferable to opt out of the use of strings for linking things together. What we did with our clang-doc is that we directly used pointers to refer to other types. So for example, our class for storing Record/CXX related information has something like:

std::vector<Function*>	mMethods;
std::vector<Variable*>	mVariables;
std::vector<Enum*>	mEnums;
std::vector<Typedef*>	mTypedefs;

Only upon serialization we fetch some kind of USR that would uniquely identify the type. This is especially useful to us for the conversion to HTML and I think the same would go for this backend, as it seems this way you'll have to do string lookups to get to the actual types, which would be inefficient in multiple aspects. It can make the backend a little more of a one-on-one conversion, e.g. with one of our HTML template definitions (note: this is a Jinja2 template in Python):

{%- for enum in inEntry.GetMemberEnums() -%}
	<tr class="separator">
		<td class="memSeparator" colspan="3"></td>
	</tr>
	<tr class="memitem:EAllocatorStrategy">
		<td class="memItemLeft" align="right">{{- Modifiers.RenderAccessModifier(enum.GetAccessModifier()) -}}</td>
		<td class="memItemMiddle" align="left">enum <a href="{{ enum.GetID() }}.html">{{- enum.GetName().GetName()|e -}}</a></td>
		<td class="memItemRight" valign="bottom">{{- Descriptions.RenderDescription(enum.GetBriefDescription()) -}}</td>
	</tr>
{%- endfor -%}

Disadvantage is of course that you add complexity to certain parts of the deserialization (/serialization) for nested types and inheritance, by either having to do so in the correct order or having to defer the process of initializing these pointers. But see this as just as some thought sharing. I do think this would improve the interaction in the backend (assuming you use the same representation as currently in the frontend). Also, we didn't apply this to our Type representation (which we use to store the type of a member, parameter etc.), which stores the name of the type rather than a pointer to it (since it can also be a built-in), though it embeds pretty much every possible modifier on said type, like this:

EntryName			mName;									
bool				mIsConst = false;						
EReferenceType			mReferenceType = EReferenceType::None;	
std::vector<bool>		mPointerConstnessMask;					
std::vector<std::string>	mArraySizes;							
bool				mIsAtomic = false;						
std::vector<Attribute>		mAttributes;							
bool				mIsExpansion = false;					
std::vector<TemplateArgument>	mTemplateArguments;						
std::unique_ptr<FunctionTypeProperties>     mFunctionTypeProperties = nullptr;		
EntryName			mParentCXXEntry;

The last member refers to the case where a pointer is a pointer to member, though some other fields may require some explaining too. Anyway, this is just to give some insight into how we structured our representation, where we largely omitted string representations where possible.

Have you actually started work already on some backend? Developing backend and frontend in tandem can provide some additional insights as to how things should be structured, especially representation-wise!

clang-doc/Representation.h
113 ↗(On Diff #135559)

How come these are actually unique ptrs? They can be stored directly in the vector, right? (same for CommentInfo children, FnctionInfo params etc.)

Please run Clang-format and Clang-tidy modernize.

clang-doc/Representation.h
80 ↗(On Diff #135559)

Please separate constructors from data members with empty line.

juliehockett marked 29 inline comments as done.
  1. Continued refactoring the bitcode writer
  2. Added a USR attribute to infos
  3. Created a Reference struct to replace the string references to other infos

Disadvantage is of course that you add complexity to certain parts of the deserialization (/serialization) for nested types and inheritance, by either having to do so in the correct order or having to defer the process of initializing these pointers. But see this as just as some thought sharing. I do think this would improve the interaction in the backend (assuming you use the same representation as currently in the frontend).

I agree that the pointer approach would be much more efficient on the backend, but the issue here is that the mapper has no idea where the representation of anything other than the decl it's currently looking at will be, since it sees each decl and serializes it immediately. The reducer, on the other hand, will be able to see everything, and so such pointers could be added as a pass over the final reduced data structure.
So, as an idea (as this diff implements), I updated the string references to be a struct, which holds the USR of the referenced type (for serialization, both here in the mapper and for the dump option in the reducer, as well as a pointer to an Info struct. This pointer is not used at this point, but would be populated by the reducer. Thoughts?

Have you actually started work already on some backend? Developing backend and frontend in tandem can provide some additional insights as to how things should be structured, especially representation-wise!

I added you as a subscriber on the follow-up patches (the reducer, YAML/MD formats) -- would love to hear your thoughts! As of now, the MD output is very rough, but I'm hoping to keep moving forward on that in the next few days.

clang-doc/BitcodeWriter.h
53 ↗(On Diff #135559)

It's the current abbrev id width for the block (described here), so it's the max id width for the block's abbrevs.

94 ↗(On Diff #135559)

Yes, the enum is an implementation detail (FUNCTION_MANGLED_NAME should have been removed earlier). I'll put the comment describing how it works!

Fixing CMakeLists formatting

Could you please add a bit more tests? In particular, i'd like to see how blocks-in-blocks work.
I.e. class-in-class, class-in-function, ...

Is there some (internal to BitstreamWriter) logic that would 'assert()' if trying to output some recordid
which is, according to the BLOCKINFO_BLOCK, should not be there?
E.g. outputting VERSION in BI_COMMENT_BLOCK_ID?

clang-doc/BitcodeWriter.cpp
30 ↗(On Diff #135682)

Ok, these three functions still look off, how about this?

// Yes, not by reference, https://godbolt.org/g/T52Vcj
static void AbbrevGen(std::shared_ptr<llvm::BitCodeAbbrev> &Abbrev,
                      const std::initializer_list<llvm::BitCodeAbbrevOp> Ops) {
  for(const auto &Op : Ops)
    Abbrev->Add(Op);
}

static void IntAbbrev(std::shared_ptr<llvm::BitCodeAbbrev> &Abbrev) {
  AbbrevGen(Abbrev, {
    // 0. Fixed-size integer
    {llvm::BitCodeAbbrevOp::Fixed, BitCodeConstants::IntSize}});
}

static void StringAbbrev(std::shared_ptr<llvm::BitCodeAbbrev> &Abbrev) {
  AbbrevGen(Abbrev, {
    // 0. Fixed-size integer (length of the following string)
    {llvm::BitCodeAbbrevOp::Fixed, BitCodeConstants::StringLengthSize},
    // 1. The string blob
    {llvm::BitCodeAbbrevOp::Blob}});
}

// Assumes that the file will not have more than 65535 lines.
static void LocationAbbrev(std::shared_ptr<llvm::BitCodeAbbrev> &Abbrev) {
  AbbrevGen(Abbrev, {
    // 0. Fixed-size integer (line number)
    {llvm::BitCodeAbbrevOp::Fixed, BitCodeConstants::LineNumberSize},
    // 1. Fixed-size integer (length of the following string (filename))
    {llvm::BitCodeAbbrevOp::Fixed, BitCodeConstants::StringLengthSize},
    // 2. the string blob
    {llvm::BitCodeAbbrevOp::Blob}});
}

Though i bet clang-format will mess-up the formatting again :/

108 ↗(On Diff #135682)

Some of these IntAbbrev's are actually bools.
Would it make sense to already think about being bitcode-size-conservative and introduce BoolAbbrev from the get go?

static void BoolAbbrev(std::shared_ptr<llvm::BitCodeAbbrev> &Abbrev) {
  AbbrevGen(Abbrev, {
    // 0. Fixed-size boolean
    {llvm::BitCodeAbbrevOp::Fixed, BitCodeConstants::BoolSize}});
}

where BitCodeConstants::BoolSize = 1U
?
Or is there some internal padding that would make that pointless?

156 ↗(On Diff #135682)

Uh, oh, i'm sorry, all(?) these "Unknown Abbreviation" are likely copypaste gone wrong.
I'm not sure why i wrote that comment. "Unknown RecordId" might make more sense?

240 ↗(On Diff #135682)

Ok, now that i think about it, it can't be that easy.
Maybe

FIXME: assumes 8 bits per byte
assert(llvm::APInt(8U*sizeof(Val), Val, /*isSigned=*/true).getBitWidth() <= BitCodeConstants::IntSize));

Not sure whether getBitWidth() is really the right function to ask though.
(Not sure how this all works for negative numbers)

clang-doc/BitcodeWriter.h
172 ↗(On Diff #135682)

Newline after constructor

216 ↗(On Diff #135682)

// Emission of appropriate abbreviation type

53 ↗(On Diff #135559)

So in other words that static_assert() is doing the right thing?
Add it after the enum BlockId{} then please, will both document things, and ensure that things remain in a sane state.

Thank you for working on this!
Some more thoughts.

clang-doc/BitcodeWriter.cpp
191 ↗(On Diff #135682)

Why do we have this indirection?
Is there a need to first to (unefficiently?) copy to Record, and then emit from there?
Wouldn't this work just as well?

Record.clear();
Stream.EmitRecord(llvm::bitc::BLOCKINFO_CODE_BLOCKNAME, BlockIdNameMap[ID]);
196 ↗(On Diff #135682)

Hmm, so i've been staring at this and http://llvm.org/doxygen/classllvm_1_1BitstreamWriter.html and i must say i'm not fond of this indirection.

What i don't understand is, in previous function, we don't store BlockId, why do we want to store RecordId?
Aren't they both unstable, and are implementation detail?
Do we want to store it (RecordId)? If yes, please explain it as a new comment in code.

If no, i guess this would work too?

assert(RecordIdNameMap[ID] && "Unknown Abbreviation");
Record.clear();
Stream.EmitRecord(llvm::bitc::BLOCKINFO_CODE_SETRECORDNAME, RecordIdNameMap[ID].Name);

And after that you can lower the default size of SmallVector<> Record down to, hm, 4?

clang-doc/BitcodeWriter.h
161 ↗(On Diff #135682)

This alias is used exactly once, for Record member variable in this class.
Is there any point in having this alias?

161 ↗(On Diff #135682)

Also, why is uint64_t used?
We either push char, or enum, or int. Do we ever need 64-bit?

clang-doc/ClangDoc.h
48

Please add space before {}, and drop unneeded ;

clang-doc/Mapper.h
56 ↗(On Diff #135682)

ClangDocMapper class is staring to look like a god-class.
I would recommend:

  1. Rename ClangDocMapper to ClangDocASTVisitor. It's kind-of conventional to name RecursiveASTVisitor-based classes like that.
  2. Move ClangDocCommentVisitor out of the ClangDocMapper, into namespace {} in clang-doc/Mapper.cpp
    • Split ClangDocSerializer into new .h/.cpp
    • Replace ClangDocSerializer Serializer; with ClangDocSerializer& Serializer;
    • Instantiate ClangDocSerializer (in MapperActionFactory, i think?) before ClangDocMapper
    • Pass ClangDocSerializer& into ClangDocMapper ctor.
juliehockett marked 10 inline comments as done.
  1. Moved the serialization logic out of the Mapper class and into its own namespace
  2. Updated tests
  3. Addressing comments

Is there some (internal to BitstreamWriter) logic that would 'assert()' if trying to output some recordid
which is, according to the BLOCKINFO_BLOCK, should not be there?
E.g. outputting VERSION in BI_COMMENT_BLOCK_ID?

Yes -- it will fail an assertion:
Assertion 'V == Op.getLiteralValue() && "Invalid abbrev for record!"' failed.

clang-doc/BitcodeWriter.cpp
191 ↗(On Diff #135682)

No, since BlockIdNameMap[ID] returns a StringRef, which can be manipulated into an std::string or a const char*, but the Stream wants an unsigned char. So, the copying is to satisfy that. Unless there's a better way to convert a StringRef into an array of unsigned char?

196 ↗(On Diff #135682)

I'm not entirely certain what you mean -- in emitBlockId(), we are storing both the block id and the block name in separate records (BLOCKINFO_CODE_SETBID, BLOCKINFO_CODE_BLOCKNAME, respectively). In emitRecordId(), we're doing something slightly different, in that we emit one record with both the record id and the record name (in record BLOCKINFO_CODE_SETRECORDNAME).

Replacing the copy loop here has the same issue as above, namely that there isn't an easy way to convert between a StringRef and an array of unsigned char.

240 ↗(On Diff #135682)

That assertion fails :/ I could do something like static_cast<int64_t>(Val) == Val but that would require a) IntSize being a power of 2 b) updating the assert anytime IntSize is updated, and 3) still throws a warning about comparing a signed to an unsigned int...

clang-doc/BitcodeWriter.h
53 ↗(On Diff #135559)

No...it's the (max) number of the abbrevs relevant to the block itself, which is to say some subset of the RecordIds for any given block (e.g. for a BI_COMMENT_BLOCK, the number of abbrevs would be 12 and so on the abbrev width would be 4).

To assert for it we could put block start/end markers on the RecordIds and then use that to calculate the bitwidth, if you think the assertion should be there.

Tried fixing tooling::FrontendActionFactory::create() in D43779/D43780, but had to revert due to gcc4.8 issues :/

Thank you for working on this, some more review notes.

Is there some (internal to BitstreamWriter) logic that would 'assert()' if trying to output some recordid
which is, according to the BLOCKINFO_BLOCK, should not be there?
E.g. outputting VERSION in BI_COMMENT_BLOCK_ID?

Yes -- it will fail an assertion:
Assertion 'V == Op.getLiteralValue() && "Invalid abbrev for record!"' failed.

Ok, great.
And it will also complain if you try to output a block within block?

clang-doc/BitcodeWriter.cpp
191 ↗(On Diff #135682)

Aha, i see, did not think of that.
But there is a bytes() function in StringRef, which returns iterator_range<const unsigned char *>.
Would it help? http://llvm.org/doxygen/classllvm_1_1StringRef.html#a5e8f22c3553e341404b445430a3b075b

240 ↗(On Diff #135682)

I see. Let's not have this assertion for now, just a FIXME.

184 ↗(On Diff #136010)

That comment seems wrong.
If the namespace is indeed supposed to be closed, it should happen after the lambda is called, i.e.

          assert(RecordIdNameMap.size() == RecordIdCount);
          return RecordIdNameMap;
        }();
}  // namespace doc

// AbbreviationMap
265 ↗(On Diff #136010)

I think it is as simple as

assert(Loc.LineNumber < (1U << BitCodeConstants::LineNumberSize));

?

367 ↗(On Diff #136010)

So i guess this should be:

void ClangDocBitcodeWriter::emitBlockInfo(
    BlockId BID, const std::initializer_list<RecordId> &RIDs) {
  assert(RIDs.size() < (1U << BitCodeConstants::SubblockIDSize), "Too many records in a block!");
  emitBlockID(BID);
...

?

clang-doc/BitcodeWriter.h
53 ↗(On Diff #135559)

Aha, i see, so that should go into ClangDocBitcodeWriter::emitBlockInfoBlock(), since that already has that info.
(On a related node, it feels like this all should be somehow tablegen-generated, but that is for some later, post-commit cleanup.)

juliehockett marked 15 inline comments as done.

Fixing comments

Ok, great.
And it will also complain if you try to output a block within block?

Um...no. Since you can have subblocks within blocks.

clang-doc/BitcodeWriter.cpp
191 ↗(On Diff #135682)

Replaced it with an ArrayRef to the bytes_begin() and bytes_end(), but that only works for the block id, not the record id, since emitRecordId() also has to emit the ID number in addition to the name in the same record.

265 ↗(On Diff #136010)

LineNumber is a signed int, so the compiler complains that we're comparing signed and unsigned ints.

lebedev.ri added inline comments.Feb 28 2018, 7:23 AM
clang-doc/BitcodeWriter.h
37 ↗(On Diff #136161)

Hmm, you build with asserts enabled, right?
I tried testing this, and three tests fail with

clang-doc: /build/llvm/include/llvm/Bitcode/BitstreamWriter.h:122: void llvm::BitstreamWriter::Emit(uint32_t, unsigned int): Assertion `(Val & ~(~0U >> (32-NumBits))) == 0 && "High bits set!"' failed.
Failing Tests (3):
    Clang Tools :: clang-doc/mapper-class-in-function.cpp
    Clang Tools :: clang-doc/mapper-function.cpp
    Clang Tools :: clang-doc/mapper-method.cpp

  Expected Passes    : 6
  Unexpected Failures: 3

At least one failure is because of BoolSize, so i'd suspect the assertion itself is wrong...

juliehockett marked 3 inline comments as done.

Running clang-format and fixing newlines

clang-doc/BitcodeWriter.h
37 ↗(On Diff #136161)

I do, and I've definitely seen that one triggered before but it's been because something was off in how the data was being outputted as I was shifting things around. That said, I'm not seeing it in my local build with this diff though -- I'll update it again just to make sure they're in sync.

Thank you for working on this!
Some more review notes.
Please look into adding a bit more tests.

clang-doc/BitcodeWriter.cpp
196 ↗(On Diff #135682)

Tried locally, and yes, we do need to output record id.

What we could actually do, is simply inline that EmitRecord(), first emitting the RID, and then the name.

template <typename Container>
void EmitRecord(unsigned Code, int ID, const Container &Vals) {
  // If we don't have an abbrev to use, emit this in its fully unabbreviated
  // form.
  auto Count = static_cast<uint32_t>(makeArrayRef(Vals).size());
  EmitCode(bitc::UNABBREV_RECORD);
  EmitVBR(Code, 6);
  EmitVBR(Count + 1, 6); // Including ID
  EmitVBR64(ID, 6); // 'Prefix' with ID
  for (unsigned i = 0, e = Count; i != e; ++i)
    EmitVBR64(Vals[i], 6);
}

But that will result in rather ugly code.
So given that the record names are quite short, and all the other strings we output directly, maybe leave it as it is for now, until it shows in profiles?

179 ↗(On Diff #136303)

Since this is the only string we ever push to Record, can we add an assertion to make sure we always have enough room for it?
E.g.

for (const auto &Init : Inits) {
  RecordId RID = Init.first;
  RecordIdNameMap[RID] = Init.second;
  assert((1 + RecordIdNameMap[RID].size()) <= Record.size());
  // Since record was just created, it should not have any dynamic size.
  // Or move the small size into a variable and use it when declaring the Record and here.
}
230 ↗(On Diff #136303)

Sadly, i can not prove it via godbolt (can't add LLVM as library), but i'd expect streamlining this should at least not hurt, i.e. something like

Record.append(RecordIdNameMap[ID].Name.begin(), RecordIdNameMap[ID].Name.end());

?

clang-doc/BitcodeWriter.h
37 ↗(On Diff #136161)

I did not retry with updated tree/patch, but i'm quite sure i did hit those asserts.
My current build line:

-DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo
-DLLVM_BINUTILS_INCDIR:PATH=/usr/include
-DLLVM_BUILD_TESTS:BOOL=ON
-DLLVM_ENABLE_ASSERTIONS:BOOL=ON
-DLLVM_ENABLE_LLD:BOOL=ON
-DLLVM_ENABLE_PROJECTS:STRING=clang;libcxx;libcxxabi;compiler-rt;lld
-DLLVM_ENABLE_SPHINX:BOOL=ON
-DLLVM_ENABLE_WERROR:BOOL=ON
-DLLVM_PARALLEL_LINK_JOBS:STRING=1
-DLLVM_TARGETS_TO_BUILD:STRING=X86
-DLLVM_USE_SANITIZER:STRING=Address

Additional env variables:

export MALLOC_CHECK_=3
export MALLOC_PERTURB_=$(($RANDOM % 255 + 1))
export ASAN_OPTIONS=abort_on_error=1
export UBSAN_OPTIONS=print_stacktrace=1
226 ↗(On Diff #136303)

Needs a comment about the choice of static size of Record.
I.e. the maximal amount of stuff we expect to push there is recordname string (right now IsDefinition is the longest at 13 chars) + 1 integer.

And add a newline

// Notes
SmallVector<uint32_t, 16> Record;

llvm::BitstreamWriter &Stream;
...
clang-doc/Mapper.cpp
28 ↗(On Diff #136303)

+// If we should ignore this declaration, exit this decl
?

clang-doc/Mapper.h
30 ↗(On Diff #136303)

I wonder if we could reflect the usage of RecursiveASTVisitor in the class name.
Though ClangDocMapperASTVisitor sounds too long?

clang-doc/Representation.h
27 ↗(On Diff #136303)

Is there an intentional decision to minimize sizeof() of these structs?
Many(?) of those could be SmallString's

test/CMakeLists.txt
44

There is are no tests with CommentBlock blocks.

test/clang-doc/mapper-class-in-class.cpp
6 ↗(On Diff #136161)

Ok, so this actually produced c:@S@X.bc and c:@S@X@S@Y.bc.
Please do something like:

// RUN: llvm-bcanalyzer %t/docs/c:@S@X.bc --dump | FileCheck %s --check-prefix CHECK-X
// RUN: llvm-bcanalyzer %t/docs/c:@S@X@S@Y.bc --dump | FileCheck %s --check-prefix CHECK-X-Y

// CHECK-X: <BLOCKINFO_BLOCK/>
// CHECK-X: <VersionBlock NumWords=1 BlockCodeSize=4>
  // CHECK-X: <Version abbrevid=4 op0=1/>
// CHECK-X: </VersionBlock>
// CHECK-X: <RecordBlock NumWords=6 BlockCodeSize=4>
  // CHECK-X: <USR abbrevid=4 op0=6/> blob data = 'c:@S@X'
  // CHECK-X: <Name abbrevid=5 op0=1/> blob data = 'X'
  // CHECK-X: <IsDefinition abbrevid=7 op0=1/>
  // CHECK-X: <TagType abbrevid=10 op0=3/>
// CHECK-X: </RecordBlock>

// CHECK-X-Y: <BLOCKINFO_BLOCK/>
// CHECK-X-Y: <VersionBlock NumWords=1 BlockCodeSize=4>
  // CHECK-X-Y: <Version abbrevid=4 op0=1/>
// CHECK-X-Y: </VersionBlock>
// CHECK-X-Y: <RecordBlock NumWords=11 BlockCodeSize=4>
  // CHECK-X-Y: <USR abbrevid=4 op0=10/> blob data = 'c:@S@X@S@Y'
  // CHECK-X-Y: <Name abbrevid=5 op0=1/> blob data = 'Y'
  // CHECK-X-Y: <Namespace abbrevid=6 op0=1 op1=6/> blob data = 'c:@S@X'
  // CHECK-X-Y: <IsDefinition abbrevid=7 op0=1/>
  // CHECK-X-Y: <TagType abbrevid=10 op0=3/>
// CHECK-X-Y: </RecordBlock>

On a related note, is there any way to auto-generate these CHECK lines?
There is this llvm/utils/update_test_checks.py, but i doubt it will work here.

test/clang-doc/mapper-class-in-function.cpp
8 ↗(On Diff #136161)

Here too, i suppose

test/clang-doc/mapper-enum.cpp
7–8 ↗(On Diff #136303)

Could you please also add a similar enum class test?

17 ↗(On Diff #136303)

Can TypeBlock be on the same depth as VersionBlock?
Via using/typename? If yes, please add such a test.

test/clang-doc/mapper-method.cpp
8 ↗(On Diff #136161)

And here

juliehockett marked 14 inline comments as done.

Fixing comments and adding tests

Thank you for working on this!
Some more nitpicking.

Please consider adding even more tests (ideally, all this code should have 100% test coverage)

clang-doc/BitcodeWriter.cpp
139 ↗(On Diff #136520)

This change is not covered by tests.
(I've actually found out that the hard way, by trying to find why it didn't trigger any asssertions, oh well)

325 ↗(On Diff #136520)

I think it would be cleaner to move it (at least the enterblock, it might make sense to leave the header at the very top) after the static variable

363 ↗(On Diff #136520)

I.e.

...
, FUNCTION_IS_METHOD}}};

  Stream.EnterBlockInfoBlock();
  for (const auto &Block : TheBlocks) {
    assert(Block.second.size() < (1U << BitCodeConstants::SubblockIDSize));
    emitBlockInfo(Block.first, Block.second);
  }
  Stream.ExitBlock();

  emitVersion();
}
clang-doc/BitcodeWriter.h
19 ↗(On Diff #136520)

Please sort includes, clang-tidy complains.

32 ↗(On Diff #136520)
/build/clang-tools-extra/clang-doc/BitcodeWriter.h:32:23: warning: invalid case style for variable 'VERSION_NUMBER' [readability-identifier-naming]
static const unsigned VERSION_NUMBER = 1;
                      ^~~~~~~~~~~~~~
                      VersionNumber
163 ↗(On Diff #136520)

The simplest solution would be

#ifndef NDEBUG // Don't want explicit dtor unless needed
~ClangDocBitcodeWriter() {
  // Check that the static size is large-enough.
  assert(Record.capacity() == BitCodeConstants::RecordSize);
}
#endif
228 ↗(On Diff #136520)

So you want to be really definitive with this. I wanted to avoid that, actually..
Then i'm afraid one more assert is needed, to make sure this is *actually* true.

I'm not seeing any way to make SmallVector completely static,
so you could either add one more wrapper around it (rather ugly),
or check the final size in the ClangDocBitcodeWriter destructor (will not pinpoint when the size has 'overflowed')

246 ↗(On Diff #136520)

Does it *ever* make sense to output BlockInfoBlock anywhere else other than once at the very beginning?
I'd think you should drop the boolean param, and unconditinally call the emitBlockInfoBlock(); from ClangDocBitcodeWriter::ClangDocBitcodeWriter() ctor.

248 ↗(On Diff #136520)

The naming choices confuse me.
There is writeBitstream() and emitBlock(), which is called from writeBitstream() to write the actual contents of the block.

Why one is write and another is emit?
To match the BitstreamWriter naming choices? (which uses Emit prefix)?
To avoid the confusion of which one outputs the actual content, and which one outputs the whole block?

I think it should be:

- void emitBlock(const NamespaceInfo &I);
+ void emitBlockContent(const NamespaceInfo &I);
- void ClangDocBitcodeWriter::writeBitstream(const T &I, bool WriteBlockInfo);
+ void ClangDocBitcodeWriter::emitBlock(const T &I, bool EmitBlockInfo);

This way, *i think* their names would clearner-er state what they do, and won't be weirdly different.
What do you think?

clang-doc/Representation.h
18 ↗(On Diff #136520)

Please sort includes, clang-tidy complains.

clang-doc/Serialize.cpp
88 ↗(On Diff #136520)
/build/clang-tools-extra/clang-doc/Serialize.cpp:88:17: warning: invalid case style for variable 'i' [readability-identifier-naming]
  for (unsigned i = 0, e = C->getNumArgs(); i < e; ++i)
                ^                           ~        ~~
                I                           I        I
/build/clang-tools-extra/clang-doc/Serialize.cpp:88:24: warning: invalid case style for variable 'e' [readability-identifier-naming]
  for (unsigned i = 0, e = C->getNumArgs(); i < e; ++i)
                       ^                        ~~
                       E                        E
107 ↗(On Diff #136520)
/build/clang-tools-extra/clang-doc/Serialize.cpp:107:19: warning: invalid case style for variable 'i' [readability-identifier-naming]
    for (unsigned i = 0, e = C->getDepth(); i < e; ++i)
                  ^                         ~        ~~
                  I                         I        I
/build/clang-tools-extra/clang-doc/Serialize.cpp:107:26: warning: invalid case style for variable 'e' [readability-identifier-naming]
    for (unsigned i = 0, e = C->getDepth(); i < e; ++i)
                         ^                      ~~
                         E                      E
clang-doc/Serialize.h
19 ↗(On Diff #136520)

Please sort includes, clang-tidy complains.

clang-doc/tool/ClangDocMain.cpp
81

Why at the beginning though?
Couldn't the user pass -extra-arg=-fno-parse-all-comments, which could override this?

juliehockett marked 16 inline comments as done.

Adding tests, fixing comments, and removing an (as-of-yet) unused element of the CommentInfo struct.

clang-doc/BitcodeWriter.cpp
139 ↗(On Diff #136520)

So after a some digging, this particular field can't be tested right now as the mapper doesn't look at any TemplateDecls (something that definitely needs to be implemented, but in a follow-on patch). I've removed it for now, until it can be properly used/tested.

196 ↗(On Diff #135682)

If that makes sense to you, sounds good to me!

clang-doc/BitcodeWriter.h
37 ↗(On Diff #136161)

Figured it out -- the Reference struct didn't have default for the enum, and so if it wasn't initialized it was undefined. Should be fixed now.

test/clang-doc/mapper-enum.cpp
17 ↗(On Diff #136303)

Not currently -- I'm planning to add that functionality in the future, but right now it ignores typedef or using decls.

Could some other people please review this differential, too?
I'm sure i have missed things.


Some more nitpicking.

For this differential as standalone, i'we mostly run out of things to nitpick.
Some things can probably be done better (the blockid/recordid stuff could probably be nicer if tablegen-ed, but that is for later).

I'll try to look at the next differential, and at them combined.

clang-doc/BitcodeWriter.cpp
120 ↗(On Diff #136650)

We don't actually push these strings to the Record (but instead output them directly), so this assertion is not really meaningful, i think?

clang-doc/BitcodeWriter.h
21 ↗(On Diff #136650)

+DenseMap

21 ↗(On Diff #136650)

+StringRef

197 ↗(On Diff #136650)

Humm, you could avoid this constant, and conserve a few bits, if you move the init-list out of emitBlockInfoBlock() to somewhere e.g. after the enum RecordId, and then since the BlockId ID is already passed, you could compute it on-the-fly the same way the BitCodeConstants::SubblockIDSize is asserted in emitBlockInfo*().
Not sure if it's worth doing though. Maybe just add it as a NOTE here.

249 ↗(On Diff #136650)

Stale comment

clang-doc/Representation.h
60 ↗(On Diff #136650)

Info *Ref; isn't used anywhere

117 ↗(On Diff #136650)

llvm::Optional<Location> DefLoc; ?

juliehockett marked 11 inline comments as done.

Addressing comments

lebedev.ri added inline comments.Mar 2 2018, 10:38 AM
clang-doc/Representation.h
117 ↗(On Diff #136791)

I meant that IsDefinition controls whether DefLoc will be set/used or not.
So with llvm::Optional<Location> DefLoc, you don't need the bool IsDefinition.

juliehockett marked an inline comment as done.

Removing IsDefinition field.

clang-doc/Representation.h
117 ↗(On Diff #136791)

That...makes so much sense. Oops. Thank you!

Eugene.Zelenko added inline comments.Mar 5 2018, 6:15 PM
clang-doc/BitcodeWriter.h
160 ↗(On Diff #136809)

Looks like Clang-format was applied incorrectly, because this is Google, not LLVM style. Please note that it doesn't modify file, just output formatted code to terminal.

Please reformat other files, including those in dependent patches.

My apologies for getting back on this so late!

So, as an idea (as this diff implements), I updated the string references to be a struct, which holds the USR of the referenced type (for serialization, both here in the mapper and for the dump option in the reducer, as well as a pointer to an Info struct. This pointer is not used at this point, but would be populated by the reducer. Thoughts?

This seems like quite a decent approach! That being said, I don't see the pointer yet? I assume you mean that you will be adding this? Additionally, a slight disadvantage of doing this generic approach is that you need to do bookkeeping on what it is referencing, but I guess there's no helping that due to the architecture which makes you rely upon the USR? Personally I'd prefer having the explicit types if and where possible. So for now a RecordInfo has a vecotr of Reference's to its parents, but we know the parents can only be of certain kinds (more than just a RecordType, but you get the point); it won't be an enum, namespace or function.

As I mentioned, we did this the other way around, which also has the slight advantage that I only had to create and save the USR once per info instance (as in, 10 references to a class only add the overhead of 10 pointers, rather than each having the USR as well), but our disadvantage was of course that we had delayed serialization (although we could arguably do both simultaneously). It seems each method has its merits :).

This seems like quite a decent approach! That being said, I don't see the pointer yet? I assume you mean that you will be adding this? Additionally, a slight disadvantage of doing this generic approach is that you need to do bookkeeping on what it is referencing, but I guess there's no helping that due to the architecture which makes you rely upon the USR? Personally I'd prefer having the explicit types if and where possible. So for now a RecordInfo has a vecotr of Reference's to its parents, but we know the parents can only be of certain kinds (more than just a RecordType, but you get the point); it won't be an enum, namespace or function.

If you take a look at the follow-on patch to this (D43341), you'll see that that is where the pointer is added in (since it is irrelevant to the mapper portion, as it cannot be filled out until the information has been reduced). The back references to children and whatnot are also added there.

As I mentioned, we did this the other way around, which also has the slight advantage that I only had to create and save the USR once per info instance (as in, 10 references to a class only add the overhead of 10 pointers, rather than each having the USR as well), but our disadvantage was of course that we had delayed serialization (although we could arguably do both simultaneously). It seems each method has its merits :).

The USRs are kept for serialization purposes -- given the modular nature of the design, the goal is to be able to write out the bitstream and have it be consumable with all necessary information. Since we can't write out pointers (and it would be useless if we did, since they would change as soon as the file was read in), we maintain the USRs to have a means of re-finding the referenced declaration.

That said, I was looking at the Clangd symbol indexing code yesterday, and noticed that they're hashing the USRs (since they get a little lengthy, particularly when you have nested and/or overloaded functions). I'm going to take a look at that today to try to make the USRs more space-efficient here.

Adding hashing to reduce the size of USRs and updating tests.

Nice!
Some further notes based on the SHA1 nature.

clang-doc/BitcodeWriter.cpp
74 ↗(On Diff #137244)

Those are mixed up.
USRLengthSize is definitively supposed to be second.

81 ↗(On Diff #137244)

The sha1 is all-printable, so how about using BitCodeAbbrevOp::Encoding::Char6 ?
Char4 would work best, but it is not there.

149 ↗(On Diff #137244)

Ha, and all the *_USR are actually StringAbbrev's, not confusing at all :)

309 ↗(On Diff #137244)

Now it would make sense to also assert that this sha1(usr).strlen() == 20

clang-doc/BitcodeWriter.h
46 ↗(On Diff #137244)

Can definitively lower this to 5U (2^6 == 32, which is more than the 20 8-bit chars of sha1)

clang-doc/Representation.h
59 ↗(On Diff #137244)

Now that USR is sha1'd, this is always 20 8-bit characters long.

107 ↗(On Diff #137244)

20
Maybe place using USRString = SmallString<20>; // SHA1 of USR somewhere and use it everywhere?

Athosvk added a comment.EditedMar 7 2018, 3:03 AM

If you take a look at the follow-on patch to this (D43341), you'll see that that is where the pointer is added in (since it is irrelevant to the mapper portion, as it cannot be filled out until the information has been reduced). The back references to children and whatnot are also added there.

Oops! I'll have a look!

The USRs are kept for serialization purposes -- given the modular nature of the design, the goal is to be able to write out the bitstream and have it be consumable with all necessary information. Since we can't write out pointers (and it would be useless if we did, since they would change as soon as the file was read in), we maintain the USRs to have a means of re-finding the referenced declaration.

What I was referring to was the storing of a USR per reference. Of course, serializing pointers wouldn't work, but what I mean is that what we used as a USR was stored in what was pointed to, not in the reference that tells what we are pointing to. To be a little more concise, a RecordInfo has pointers to the FuntionInfo for its member functions. Upon serialization, the RecordInfo queries the USR of those functions. A function being referenced multiple times remains to only have the USR stored. If I understand correctly, you currently save the USR for time an InfoType references another InfoType.

Anyhow, don't pay too much attention to that comment, it's all meant as a minor thing. It sure is looking good so far!

Some further notes based on the SHA1 nature.

I'm sorry, brainfreeze, i meant 40 chars, not 20.
Updated comments...

clang-doc/BitcodeWriter.cpp
309 ↗(On Diff #137244)

40 that is

clang-doc/BitcodeWriter.h
46 ↗(On Diff #137244)

Edit: to 6U (2^6 == 64, which is more than the 40 8-bit chars of sha1)

clang-doc/Representation.h
59 ↗(On Diff #137244)

40 that is

107 ↗(On Diff #137244)

40

juliehockett marked 13 inline comments as done.

Updating bitcode writer for hashed USRs, and re-running clang-format. Also cleaning up a couple of unused fields.

Hmm, i'm missing something about the way store sha1...

clang-doc/BitcodeWriter.cpp
53 ↗(On Diff #137457)

This is VBR because USRLengthSize is of such strange size, to conserve the bits?

57 ↗(On Diff #137457)

Looking at the NumWords changes (decrease!) in the tests, and this is bugging me.
And now that i have realized what we do with USR:

  • we first compute SHA1, and get 20x uint8_t
  • store/use it internally
  • then hex-ify it, getting 40x char (assuming 8-bit char)
  • then convert to char6, winning back two bits. but we still loose 2 bits.

Question: *why* do we store sha1 of USR as a string?
Why can't we just store that USRString (aka USRSha1 binary) directly?
That would be just 20 bytes, you just couldn't go any lower than that.

clang-doc/Representation.h
29 ↗(On Diff #137457)

Right, of course, internally this is kept in the binary format, which is just 20 chars.
This is not the string (the hex-ified version of sha1), but the raw sha1, the binary.
This should somehow convey that. This should be something closer to USRSha1.

sammccall accepted this revision.Mar 8 2018, 4:51 PM

There's a few places where we can trim some of the boilerplate, which I think is important - it's hard to find the "real code" among all the plumbing in places.
Other than that, this seems OK to me.

clang-doc/BitcodeWriter.h
116 ↗(On Diff #137457)

I think you don't want to declare ID in the unspecialized template, so you get a compile error if you try to use it.

(Using traits for this sort of thing seems a bit overboard to me, but YMMV)

154 ↗(On Diff #137457)

Hmm, you spend a lot of effort plumbing this variable around! Why is it so important?
Filesize? (I'm not that familiar with LLVM bitcode, but surely we'll end up with a string table anyway?)

If it really is an important option people will want, the command-line arg should probably say why.

241 ↗(On Diff #137457)

OK, I don't get this at all.

We have to declare emitBlockContent(NamespaceInfo) *and* the specialization of MapFromInfoToBlockId<NamespaceInfo>, and deal with the public interface emitBlock being a template function where you can't tell what's legal to pass, instead of writing:

void emitBlock(const NamespaceInfo &I) {
  SubStreamBlockGuard Block(Stream, BI_NAMESPACE_BLOCK_ID); // <-- this one line
  ...
}

This really seems like templates for the sake of templates :(

clang-doc/ClangDoc.h
11

This comment doesn't seem accurate - there's no main() in this file.
There's a FrontendActionFactory, but nothing in this file uses it.

38

nit: seems odd to put all this implementation in the header.
(personally I'd just expose a function returning unique_ptr<FrontendActionFactory> from the header, but up to you...)

39

for ASTConsumers implemented by ASTVisitors, there seems a fairly strong convention to just make the same class extend both (MapASTVisitor, here).
That would eliminate one plumbing class...

clang-doc/Mapper.cpp
33 ↗(On Diff #137457)

It seems a bit of a poor fit to use a complete bitcode file (header, version, block info) as your value format when you know the format, and know there'll be no version skew.
Is it easy just to emit the block we care about?

clang-doc/Representation.h
29 ↗(On Diff #137457)

I'm not sure that any of the implementation (either USR or SHA) belongs in the type name.
In clangd we called this type SymbolID, which seems like a reasonable name here too.

44 ↗(On Diff #137457)

this is probably the right place to document these fields - what are the legal kinds? what's the name of a comment, direction, etc?

This revision is now accepted and ready to land.Mar 8 2018, 4:51 PM
This revision was automatically updated to reflect the committed changes.
juliehockett marked 11 inline comments as done.

Might have been better to not start landing until the all differentials are understood/accepted, but i understand that it is not really up to me to decide.
Let's hope nothing in the next differentials will require changes to this initial code :)

clang-doc/BitcodeWriter.h
241 ↗(On Diff #137457)

If you want to add a new block, in one case you just need to add one

template <> struct MapFromInfoToBlockId<???Info> {
  static const BlockId ID = BI_???_BLOCK_ID;
};

In the other case you need to add whole

void ClangDocBitcodeWriter::emitBlock(const ???Info &I) {
  StreamSubBlockGuard Block(Stream, BI_???_BLOCK_ID);
  emitBlockContent(I);
}

(and it was even longer initially)
It seems just templating one static variable is shorter than duplicating emitBlock() each time, no?

Do compare the current diff with the original diff state.
I *think* these templates helped move much of the duplication to simplify the code overall.

Since the commit was reverted, did you mean to either recommit it, or reopen this (with updated diff), so it does not get lost?

Since the commit was reverted, did you mean to either recommit it, or reopen this (with updated diff), so it does not get lost?

Relanded in r327295.

clang-doc/BitcodeWriter.h
154 ↗(On Diff #137457)

It was for testing purposes (so that the tests aren't flaky on filenames), but I replaced it with regex.

241 ↗(On Diff #137457)

You'd still have to add the appropriate emitBlock() function for any new block, since it would have different attributes.

clang-doc/Mapper.cpp
33 ↗(On Diff #137457)

Ideally, yes, but right now in the clang BitstreamWriter there's no way to tell the instance what all the abbreviations are without also emitting the blockinfo to the output stream, though I'm thinking about taking a stab at separating the two.

Also, this relies on the llvm-bcanalyzer for testing, which requires both the header and the blockinfo in order to read the data :/

lebedev.ri added inline comments.Mar 14 2018, 1:44 PM
clang-doc/BitcodeWriter.cpp
230 ↗(On Diff #136303)

And https://github.com/mattgodbolt/compiler-explorer/issues/841 is done,
so now we can see that SmallVector::append() at least results in less code:
https://godbolt.org/g/xJQ59c

So what part is failing, specifically?
The SHA1 blobs of USR's differ in the llvm-bcanalyzer dumps?
The actual filenames %t/docs/bc/<sha1-to-text> differ?
I guess both?

First one you should be able to handle by replacing the actual values with a regex
(i'd guess <USR abbrevid=4 op0=20 op1=11 <...> op19=226 op20=232/> -> <USR abbrevid=4 .*/>, but did not try)
I'm not sure we care about the actual values here, do we?

Second one is interesting.
If we assume that the order in which those are generated is the same, which i think is a safer assumption,
then you could just use result id, not key (sha1-to-text of USR), i.e. %t/docs/bc/00.bc, %t/docs/bc/01.bc and so on.
I.e. something like:

  if (DumpMapperResult) {
+   unsigned id = 0;
    Exec->get()->getToolResults()->forEachResult([&](StringRef Key,
                                                     StringRef Value) {
      SmallString<128> IRRootPath;
      llvm::sys::path::native(OutDirectory, IRRootPath);
      llvm::sys::path::append(IRRootPath, "bc");
      std::error_code DirectoryStatus =
          llvm::sys::fs::create_directories(IRRootPath);
      if (DirectoryStatus != OK) {
        llvm::errs() << "Unable to create documentation directories.\n";
        return;
      }
-     llvm::sys::path::append(IRRootPath, Key + ".bc");
+     llvm::sys::path::append(IRRootPath, std::to_string(id) + ".bc");
      std::error_code OutErrorInfo;
      llvm::raw_fd_ostream OS(IRRootPath, OutErrorInfo, llvm::sys::fs::F_None);
      if (OutErrorInfo != OK) {
        llvm::errs() << "Error opening documentation file.\n";
        return;
      }
      OS << Value;
      OS.close();
+     id++;
    });
  }

Hm, or possibly you could just pass the triple to clang?

I was just thinking of disabling the one test that has an issue (class-in-function) on Windows -- the filename is only used in generating *some* USRs, so all of the other ones are fine. We ran into some issues with that though, since UNSUPPORTED: system-windows didn't seem to disable the test on the machine I have access to. Thoughts?

I was just thinking of disabling the one test that has an issue (class-in-function) on Windows -- the filename is only used in generating *some* USRs, so all of the other ones are fine. We ran into some issues with that though, since UNSUPPORTED: system-windows didn't seem to disable the test on the machine I have access to. Thoughts?

UNSUPPORTED: system-windows

Perhaps that is only for msvc?

Have you tried something more broad, like
UNSUPPORTED: mingw32,win32
?

Have you tried something more broad, like
UNSUPPORTED: mingw32,win32
?

That wasn't working either, confusingly, at least on the local windows machine I have.

Huh, something weird is going on there.
What about the other way around, REQUIRES: linux ?

After much digging, it looks like the lit config is never initialized in clang-tools-extra like it is in the other projects. REQUIRES et.al. work properly once that's in there (see D44708). Once that lands I'll reland this and *hopefully* that'll be that!